WIP: KubeVirtRelieveAndMigrate: add stable scoring pipeline and eviction cooldown to reduce descheduling churn by tiraboschi · Pull Request #1744 · openshift/cluster-kube-descheduler-operator

tiraboschi · 2026-04-17T14:02:29Z

This PR introduces two improvements to the Prometheus-based node scoring used by the descheduler, aimed at reducing instability and spurious eviction loops.

Stable per-dimension noise filtering (p66/5m)

Rather than computing the actuation priority from instantaneous 1-minute averages, each positive-deviation dimension (CPU utilization, CPU pressure, memory utilization, memory pressure) is now filtered independently through quantile_over_time(0.66, ...[5m]) before being combined into the Euclidean distance. Applying the quantile per-dimension, rather than to the final score, is important because the Euclidean distance squares each input: filtering after squaring leaves transient spikes partially visible in the combined result. Per-dimension noise is also often asynchronous (a CPU spike in one interval, a memory spike in another); filtering independently prevents each spike from contributing to the distance at all. The existing :avg1m chain is preserved for comparison and debugging.

Eviction cooldown multiplier

A cooldown mechanism suppresses the actuation priority of nodes that have been recently evicted from, to prevent the descheduler from repeatedly targeting the same node before the effects of prior evictions have settled. The suppression is proportional to the number of successful LowNodeUtilization evictions in the past 15 minutes (linear decay, 25% per eviction), and decays naturally as old evictions leave the sliding window.
Known limitation: the cooldown is applied to the eviction source node. Ideally it would also dampen the score of the receiving node, but the descheduler is only involved in the eviction act itself while pod placement is entirely up to the scheduler, which the descheduler has no visibility into.
Requires: kubernetes-sigs/descheduler#1856

openshift-ci · 2026-04-17T15:25:34Z

@tiraboschi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/security	`1dd8e29`	link	false	`/test security`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

fabiand · 2026-04-17T15:35:53Z

              1.0
            )
+
+        # Stable per-dimension deviations: p66 over 5m


how about p80? 80% seems to be a common threshold

This is a quantile meaning the value that was >= then 66% of recorded data points over the last 5-minute window.
p50 is the median, p66 is the "worse-than-average but not outlier" value, p80 will capture shorter bursts of pressure or utilization that the p66 would ignore. p99 will capture all the short-lived spikes.

fabiand · 2026-04-17T15:37:07Z

+            descheduler:nodeutilization:cpu:avg1m * 0
+
+        # Calculate the Dampening Factor (Multiplier)
+        # We use a linear decay: each eviction reduces the score by 25%.


So it is not reducing by 25%, because this owuld never converge, so we are saying it is a budget of 4 that we allow?

Ah, I see the clamping is leading to the convergence.

It's an eviction penalty, not a simple boolean gate due to a budget of evictions over the time.
Even the first eviction will reduce the score of the node by 25%. So the second eviction will be triggered only if, even with the 25% penalization, the score of the node is still so high to classify it as overutilized and so on.

openshift-ci · 2026-05-05T14:03:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign p0lyn0mial for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…ering before Euclidean distance Introduces a parallel chain of recording rules that apply quantile_over_time(0.66, ...[5m]) to each positive-deviation dimension independently, before they are combined into the Euclidean distance: descheduler:nodeutilization:cpu:p66_5m:positivedeviation descheduler:nodepressure:cpu:p66_5m:positivedeviation descheduler:nodeutilization:memory:p66_5m:positivedeviation descheduler:nodepressure:memory:p66_5m:positivedeviation From these, a stable Euclidean distance and its k=3 amplified form are computed: descheduler:node:ideal_point_positive_distance:p66_5m descheduler:node:linear_amplified_ideal_point_positive_distance:k3:p66_5m The quantile is applied per-dimension rather than to the final score because the Euclidean distance squares each input: filtering after squaring would leave transient spikes partially visible in the combined result. More importantly, per-dimension noise is often asynchronous (CPU spikes in one interval, memory in another); filtering independently prevents each spike from contributing to the distance at all, whereas filtering the combined score after the fact cannot undo the amplification from squaring. The existing :avg1m chain is kept for comparison and debugging. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>

…epeated evictions from the same node Introduces three new Prometheus recording rules to implement a cooldown mechanism for the LowNodeUtilization descheduling strategy: - descheduler:node:eviction_count:10m: counts successful evictions per node over a 10-minute sliding window. Uses label_replace to map the metric's `node` label to `instance` for joining with utilization metrics. The `or` with `nodeutilization:cpu:avg1m * 0` ensures all nodes appear in the result even when they have had no recent evictions, preventing them from being dropped in downstream joins. - descheduler:node:cooldown_multiplier:10m: linear decay factor derived from the eviction count. Each eviction reduces the multiplier by 10%, so 10 evictions in 10 minutes effectively mute the node (clamped to 0.01 to keep it visible in dashboards). The cooldown naturally decays as old evictions leave the sliding window. - descheduler:node:final_actuation_priority:p66_5m: final actuation score combining the stable noise-filtered distance with the cooldown multiplier, suppressing nodes that were recently targeted to reduce churn and improve stability. Known limitations: - The cooldown is applied to the node from which pods were evicted. Ideally it would also dampen the score of the node where the evicted workload eventually lands, to prevent it from becoming over-loaded. However, the descheduler is only responsible for the eviction act itself; pod placement after eviction is entirely up to the scheduler, and the descheduler has no visibility into the destination node. - increase() over a 10m window can return fractional values in sparse data due to extrapolation at window boundaries, making the per-eviction thresholds slightly fuzzy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>

tiraboschi changed the title ~~KubeVirtRelieveAndMigrate: add stable scoring pipeline and eviction cooldown to reduce descheduling churn~~ WIP: KubeVirtRelieveAndMigrate: add stable scoring pipeline and eviction cooldown to reduce descheduling churn Apr 17, 2026

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 17, 2026

openshift-ci Bot requested review from ingvagabund and ricardomaraschini April 17, 2026 14:03

fabiand reviewed Apr 17, 2026

View reviewed changes

tiraboschi force-pushed the kv_cooldown branch from 1dd8e29 to f585aa2 Compare May 5, 2026 14:03

tiraboschi and others added 2 commits May 6, 2026 10:25

tiraboschi force-pushed the kv_cooldown branch from f585aa2 to 76f7b27 Compare May 6, 2026 08:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: KubeVirtRelieveAndMigrate: add stable scoring pipeline and eviction cooldown to reduce descheduling churn#1744

WIP: KubeVirtRelieveAndMigrate: add stable scoring pipeline and eviction cooldown to reduce descheduling churn#1744
tiraboschi wants to merge 2 commits into
openshift:mainfrom
tiraboschi:kv_cooldown

tiraboschi commented Apr 17, 2026

Uh oh!

openshift-ci Bot commented Apr 17, 2026

Uh oh!

fabiand Apr 17, 2026

Uh oh!

tiraboschi Apr 17, 2026

Uh oh!

fabiand Apr 17, 2026

Uh oh!

fabiand Apr 17, 2026

Uh oh!

tiraboschi Apr 17, 2026

Uh oh!

openshift-ci Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tiraboschi commented Apr 17, 2026

Stable per-dimension noise filtering (p66/5m)

Eviction cooldown multiplier

Uh oh!

openshift-ci Bot commented Apr 17, 2026

Uh oh!

fabiand Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

tiraboschi Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

fabiand Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

fabiand Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

tiraboschi Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants