WIP: KubeVirtRelieveAndMigrate: add stable scoring pipeline and eviction cooldown to reduce descheduling churn#1744
WIP: KubeVirtRelieveAndMigrate: add stable scoring pipeline and eviction cooldown to reduce descheduling churn#1744tiraboschi wants to merge 2 commits into
Conversation
|
@tiraboschi: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| 1.0 | ||
| ) | ||
|
|
||
| # Stable per-dimension deviations: p66 over 5m |
There was a problem hiding this comment.
how about p80? 80% seems to be a common threshold
There was a problem hiding this comment.
This is a quantile meaning the value that was >= then 66% of recorded data points over the last 5-minute window.
p50 is the median, p66 is the "worse-than-average but not outlier" value, p80 will capture shorter bursts of pressure or utilization that the p66 would ignore. p99 will capture all the short-lived spikes.
| descheduler:nodeutilization:cpu:avg1m * 0 | ||
|
|
||
| # Calculate the Dampening Factor (Multiplier) | ||
| # We use a linear decay: each eviction reduces the score by 25%. |
There was a problem hiding this comment.
So it is not reducing by 25%, because this owuld never converge, so we are saying it is a budget of 4 that we allow?
There was a problem hiding this comment.
Ah, I see the clamping is leading to the convergence.
There was a problem hiding this comment.
It's an eviction penalty, not a simple boolean gate due to a budget of evictions over the time.
Even the first eviction will reduce the score of the node by 25%. So the second eviction will be triggered only if, even with the 25% penalization, the score of the node is still so high to classify it as overutilized and so on.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
…ering before Euclidean distance Introduces a parallel chain of recording rules that apply quantile_over_time(0.66, ...[5m]) to each positive-deviation dimension independently, before they are combined into the Euclidean distance: descheduler:nodeutilization:cpu:p66_5m:positivedeviation descheduler:nodepressure:cpu:p66_5m:positivedeviation descheduler:nodeutilization:memory:p66_5m:positivedeviation descheduler:nodepressure:memory:p66_5m:positivedeviation From these, a stable Euclidean distance and its k=3 amplified form are computed: descheduler:node:ideal_point_positive_distance:p66_5m descheduler:node:linear_amplified_ideal_point_positive_distance:k3:p66_5m The quantile is applied per-dimension rather than to the final score because the Euclidean distance squares each input: filtering after squaring would leave transient spikes partially visible in the combined result. More importantly, per-dimension noise is often asynchronous (CPU spikes in one interval, memory in another); filtering independently prevents each spike from contributing to the distance at all, whereas filtering the combined score after the fact cannot undo the amplification from squaring. The existing :avg1m chain is kept for comparison and debugging. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>
…epeated evictions from the same node Introduces three new Prometheus recording rules to implement a cooldown mechanism for the LowNodeUtilization descheduling strategy: - descheduler:node:eviction_count:10m: counts successful evictions per node over a 10-minute sliding window. Uses label_replace to map the metric's `node` label to `instance` for joining with utilization metrics. The `or` with `nodeutilization:cpu:avg1m * 0` ensures all nodes appear in the result even when they have had no recent evictions, preventing them from being dropped in downstream joins. - descheduler:node:cooldown_multiplier:10m: linear decay factor derived from the eviction count. Each eviction reduces the multiplier by 10%, so 10 evictions in 10 minutes effectively mute the node (clamped to 0.01 to keep it visible in dashboards). The cooldown naturally decays as old evictions leave the sliding window. - descheduler:node:final_actuation_priority:p66_5m: final actuation score combining the stable noise-filtered distance with the cooldown multiplier, suppressing nodes that were recently targeted to reduce churn and improve stability. Known limitations: - The cooldown is applied to the node from which pods were evicted. Ideally it would also dampen the score of the node where the evicted workload eventually lands, to prevent it from becoming over-loaded. However, the descheduler is only responsible for the eviction act itself; pod placement after eviction is entirely up to the scheduler, and the descheduler has no visibility into the destination node. - increase() over a 10m window can return fractional values in sparse data due to extrapolation at window boundaries, making the per-eviction thresholds slightly fuzzy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>
This PR introduces two improvements to the Prometheus-based node scoring used by the descheduler, aimed at reducing instability and spurious eviction loops.
Stable per-dimension noise filtering (p66/5m)
Rather than computing the actuation priority from instantaneous 1-minute averages, each positive-deviation dimension (CPU utilization, CPU pressure, memory utilization, memory pressure) is now filtered independently through
quantile_over_time(0.66, ...[5m])before being combined into the Euclidean distance. Applying the quantile per-dimension, rather than to the final score, is important because the Euclidean distance squares each input: filtering after squaring leaves transient spikes partially visible in the combined result. Per-dimension noise is also often asynchronous (a CPU spike in one interval, a memory spike in another); filtering independently prevents each spike from contributing to the distance at all. The existing :avg1m chain is preserved for comparison and debugging.Eviction cooldown multiplier
A cooldown mechanism suppresses the actuation priority of nodes that have been recently evicted from, to prevent the descheduler from repeatedly targeting the same node before the effects of prior evictions have settled. The suppression is proportional to the number of successful
LowNodeUtilizationevictions in the past 15 minutes (linear decay, 25% per eviction), and decays naturally as old evictions leave the sliding window.Known limitation: the cooldown is applied to the eviction source node. Ideally it would also dampen the score of the receiving node, but the descheduler is only involved in the eviction act itself while pod placement is entirely up to the scheduler, which the descheduler has no visibility into.
Requires: kubernetes-sigs/descheduler#1856