Skip to content

WIP: KubeVirtRelieveAndMigrate: add stable scoring pipeline and eviction cooldown to reduce descheduling churn#1744

Open
tiraboschi wants to merge 2 commits into
openshift:mainfrom
tiraboschi:kv_cooldown
Open

WIP: KubeVirtRelieveAndMigrate: add stable scoring pipeline and eviction cooldown to reduce descheduling churn#1744
tiraboschi wants to merge 2 commits into
openshift:mainfrom
tiraboschi:kv_cooldown

Conversation

@tiraboschi
Copy link
Copy Markdown
Contributor

This PR introduces two improvements to the Prometheus-based node scoring used by the descheduler, aimed at reducing instability and spurious eviction loops.

Stable per-dimension noise filtering (p66/5m)

Rather than computing the actuation priority from instantaneous 1-minute averages, each positive-deviation dimension (CPU utilization, CPU pressure, memory utilization, memory pressure) is now filtered independently through quantile_over_time(0.66, ...[5m]) before being combined into the Euclidean distance. Applying the quantile per-dimension, rather than to the final score, is important because the Euclidean distance squares each input: filtering after squaring leaves transient spikes partially visible in the combined result. Per-dimension noise is also often asynchronous (a CPU spike in one interval, a memory spike in another); filtering independently prevents each spike from contributing to the distance at all. The existing :avg1m chain is preserved for comparison and debugging.

Eviction cooldown multiplier

A cooldown mechanism suppresses the actuation priority of nodes that have been recently evicted from, to prevent the descheduler from repeatedly targeting the same node before the effects of prior evictions have settled. The suppression is proportional to the number of successful LowNodeUtilization evictions in the past 15 minutes (linear decay, 25% per eviction), and decays naturally as old evictions leave the sliding window.
Known limitation: the cooldown is applied to the eviction source node. Ideally it would also dampen the score of the receiving node, but the descheduler is only involved in the eviction act itself while pod placement is entirely up to the scheduler, which the descheduler has no visibility into.
Requires: kubernetes-sigs/descheduler#1856

@tiraboschi tiraboschi changed the title KubeVirtRelieveAndMigrate: add stable scoring pipeline and eviction cooldown to reduce descheduling churn WIP: KubeVirtRelieveAndMigrate: add stable scoring pipeline and eviction cooldown to reduce descheduling churn Apr 17, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 17, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 17, 2026

@tiraboschi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/security 1dd8e29 link false /test security

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

1.0
)

# Stable per-dimension deviations: p66 over 5m
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about p80? 80% seems to be a common threshold

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a quantile meaning the value that was >= then 66% of recorded data points over the last 5-minute window.
p50 is the median, p66 is the "worse-than-average but not outlier" value, p80 will capture shorter bursts of pressure or utilization that the p66 would ignore. p99 will capture all the short-lived spikes.

descheduler:nodeutilization:cpu:avg1m * 0

# Calculate the Dampening Factor (Multiplier)
# We use a linear decay: each eviction reduces the score by 25%.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is not reducing by 25%, because this owuld never converge, so we are saying it is a budget of 4 that we allow?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see the clamping is leading to the convergence.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an eviction penalty, not a simple boolean gate due to a budget of evictions over the time.
Even the first eviction will reduce the score of the node by 25%. So the second eviction will be triggered only if, even with the 25% penalization, the score of the node is still so high to classify it as overutilized and so on.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign p0lyn0mial for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tiraboschi and others added 2 commits May 6, 2026 10:25
…ering before Euclidean distance

Introduces a parallel chain of recording rules that apply
quantile_over_time(0.66, ...[5m]) to each positive-deviation dimension
independently, before they are combined into the Euclidean distance:

  descheduler:nodeutilization:cpu:p66_5m:positivedeviation
  descheduler:nodepressure:cpu:p66_5m:positivedeviation
  descheduler:nodeutilization:memory:p66_5m:positivedeviation
  descheduler:nodepressure:memory:p66_5m:positivedeviation

From these, a stable Euclidean distance and its k=3 amplified form are
computed:

  descheduler:node:ideal_point_positive_distance:p66_5m
  descheduler:node:linear_amplified_ideal_point_positive_distance:k3:p66_5m

The quantile is applied per-dimension rather than to the final score because
the Euclidean distance squares each input: filtering after squaring would
leave transient spikes partially visible in the combined result. More
importantly, per-dimension noise is often asynchronous (CPU spikes in one
interval, memory in another); filtering independently prevents each spike
from contributing to the distance at all, whereas filtering the combined
score after the fact cannot undo the amplification from squaring.

The existing :avg1m chain is kept for comparison and debugging.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>
…epeated evictions from the same node

Introduces three new Prometheus recording rules to implement a cooldown
mechanism for the LowNodeUtilization descheduling strategy:

- descheduler:node:eviction_count:10m: counts successful evictions per node
  over a 10-minute sliding window. Uses label_replace to map the metric's
  `node` label to `instance` for joining with utilization metrics. The `or`
  with `nodeutilization:cpu:avg1m * 0` ensures all nodes appear in the result
  even when they have had no recent evictions, preventing them from being
  dropped in downstream joins.

- descheduler:node:cooldown_multiplier:10m: linear decay factor derived from
  the eviction count. Each eviction reduces the multiplier by 10%, so 10
  evictions in 10 minutes effectively mute the node (clamped to 0.01 to keep
  it visible in dashboards). The cooldown naturally decays as old evictions
  leave the sliding window.

- descheduler:node:final_actuation_priority:p66_5m: final actuation score
  combining the stable noise-filtered distance with the cooldown multiplier,
  suppressing nodes that were recently targeted to reduce churn and improve
  stability.

Known limitations:
- The cooldown is applied to the node from which pods were evicted. Ideally
  it would also dampen the score of the node where the evicted workload
  eventually lands, to prevent it from becoming over-loaded. However, the
  descheduler is only responsible for the eviction act itself; pod placement
  after eviction is entirely up to the scheduler, and the descheduler has no
  visibility into the destination node.
- increase() over a 10m window can return fractional values in sparse data
  due to extrapolation at window boundaries, making the per-eviction
  thresholds slightly fuzzy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants