Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions bindata/assets/kube-descheduler/prometheusrule.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -127,3 +127,65 @@ spec:
3 * descheduler:node:ideal_point_positive_distance:avg1m,
1.0
)

# Stable per-dimension deviations: p66 over 5m
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about p80? 80% seems to be a common threshold

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a quantile meaning the value that was >= then 66% of recorded data points over the last 5-minute window.
p50 is the median, p66 is the "worse-than-average but not outlier" value, p80 will capture shorter bursts of pressure or utilization that the p66 would ignore. p99 will capture all the short-lived spikes.

# quantile_over_time applied per-dimension BEFORE the Euclidean distance so that
# the squaring step does not amplify transient single-dimension spikes.
# Asynchronous per-dimension noise (CPU spike in minute 1, memory spike in minute 3)
# is filtered independently; applying the quantile only to the final distance would
# leave such spikes partially visible after squaring.
- record: descheduler:nodeutilization:cpu:p66_5m:positivedeviation
expr: quantile_over_time(0.66, descheduler:nodeutilization:cpu:avg1m:positivedeviation[5m])

- record: descheduler:nodepressure:cpu:p66_5m:positivedeviation
expr: quantile_over_time(0.66, descheduler:nodepressure:cpu:avg1m:positivedeviation[5m])

- record: descheduler:nodeutilization:memory:p66_5m:positivedeviation
expr: quantile_over_time(0.66, descheduler:nodeutilization:memory:avg1m:positivedeviation[5m])

- record: descheduler:nodepressure:memory:p66_5m:positivedeviation
expr: quantile_over_time(0.66, descheduler:nodepressure:memory:avg1m:positivedeviation[5m])

# Stable Euclidean distance using noise-filtered per-dimension deviations
- record: descheduler:node:ideal_point_positive_distance:p66_5m
expr: |-
sqrt(
descheduler:nodeutilization:cpu:p66_5m:positivedeviation ^ 2 +
descheduler:nodepressure:cpu:p66_5m:positivedeviation ^ 2 +
descheduler:nodeutilization:memory:p66_5m:positivedeviation ^ 2 +
descheduler:nodepressure:memory:p66_5m:positivedeviation ^ 2
)

# Stable Linear Amplified Ideal Point Positive Distance (k=3.0)
- record: descheduler:node:linear_amplified_ideal_point_positive_distance:k3:p66_5m
expr: |-
clamp_max(
3 * descheduler:node:ideal_point_positive_distance:p66_5m,
1.0
)

# Track successful eviction by LowNodeUtilization strategy count per node in the last 10 minutes
- record: descheduler:node:eviction_count:10m
expr: |-
label_replace(
sum by (node) (increase(descheduler_pods_evicted_total{strategy="LowNodeUtilization", result="success"}[10m])),
'instance', "$1", 'node', '(.+)'
) or on (instance)
descheduler:nodeutilization:cpu:avg1m * 0

# Calculate the Dampening Factor (Multiplier)
# We use a linear decay: each eviction reduces the score by 10%.
# 10 evictions in 10m will effectively "mute" the node (multiplier close to 0).
- record: descheduler:node:cooldown_multiplier:10m
expr: |-
clamp_min(
1 - (descheduler:node:eviction_count:10m * 0.10),
0.01
)

# Actuation Priority: Stable Distance x Cooldown
# If the node was recently touched, the distance is suppressed.
- record: descheduler:node:actuation_priority:p66_5m
expr: |-
descheduler:node:linear_amplified_ideal_point_positive_distance:k3:p66_5m
* on(instance) descheduler:node:cooldown_multiplier:10m