Skip nodes without metrics in nodeutilization plugins by agentydragon · Pull Request #1852 · kubernetes-sigs/descheduler

agentydragon · 2026-03-13T19:41:36Z

Description

When using metrics-based node utilization (KubernetesMetrics or Prometheus), the LowNodeUtilization and HighNodeUtilization plugins fail entirely if any Ready node is missing from the collected metrics. This happens when a node is Ready but unreachable by metrics-server.

The metrics collector already handles missing node metrics gracefully by logging an error and continuing. However, the downstream actualUsageClient.sync() and prometheusUsageClient.sync() methods treat a missing node as a hard error, aborting the entire balance cycle for all nodes.

Change the usageClient.sync() interface to return the subset of nodes for which metrics are available. Nodes without metrics are logged at V(1) and excluded from the balance computation rather than causing a fatal error. Both LowNodeUtilization and HighNodeUtilization now operate on the filtered node list.

Checklist

Please ensure your pull request meets the following criteria before submitting
for review, these items will be used by reviewers to assess the quality and
completeness of your changes:

k8s-ci-robot · 2026-03-13T19:41:46Z

Welcome @agentydragon!

It looks like this is your first PR to kubernetes-sigs/descheduler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/descheduler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2026-03-13T19:41:47Z

Hi @agentydragon. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

linux-foundation-easycla · 2026-03-13T19:41:48Z

The committers listed above are authorized under a signed CLA.

✅ login: agentydragon / name: Rai (6e16efc)

k8s-ci-robot · 2026-03-13T19:41:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ingvagabund for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR adjusts the nodeutilization plugins’ metrics-based usage clients so that nodes missing usage metrics are skipped (instead of aborting the entire balance cycle), improving resilience when metrics-server/Prometheus cannot provide data for all Ready nodes.

Changes:

Updated the internal usageClient.sync() contract to return the subset of nodes with available usage data.
Modified Low/HighNodeUtilization Balance flows to operate on the filtered node list returned by sync().
Added unit tests ensuring Actual and Prometheus usage clients skip nodes that have no collected metrics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
pkg/framework/plugins/nodeutilization/usageclients.go	Changes `sync()` to return filtered nodes; updates actual/prometheus clients to skip nodes missing metrics instead of failing.
pkg/framework/plugins/nodeutilization/lownodeutilization.go	Uses `syncedNodes` for snapshot/capacity computations and “all nodes underutilized” checks.
pkg/framework/plugins/nodeutilization/highnodeutilization.go	Uses `syncedNodes` for snapshot/capacity computations and “all nodes underutilized” checks.
pkg/framework/plugins/nodeutilization/usageclients_test.go	Updates existing tests for new `sync()` signature and adds skip-without-metrics test coverage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

This PR fixes metrics-based node utilization balancing (KubernetesMetrics / Prometheus) so that a single Ready node missing from collected metrics no longer aborts the entire LowNodeUtilization / HighNodeUtilization balance cycle. Instead, usage clients return only the subset of nodes with available metrics, and the plugins operate on that filtered list.

Changes:

Update the internal usageClient.sync() contract to return ([]*v1.Node, error) (nodes with metrics) instead of just error.
Make actualUsageClient and prometheusUsageClient skip nodes missing metrics (log at V(1)) rather than failing the whole cycle.
Add unit tests verifying both usage clients exclude nodes without metrics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
`pkg/framework/plugins/nodeutilization/usageclients.go`	Changes `sync()` to return a filtered node list; implements “skip missing metrics” behavior for actual and Prometheus usage clients.
`pkg/framework/plugins/nodeutilization/lownodeutilization.go`	Uses the filtered `syncedNodes` list for snapshot/capacity and related logic.
`pkg/framework/plugins/nodeutilization/highnodeutilization.go`	Uses the filtered `syncedNodes` list for snapshot/capacity and related logic.
`pkg/framework/plugins/nodeutilization/usageclients_test.go`	Updates existing tests for new `sync()` signature and adds coverage for skipping nodes without metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

When using metrics-based node utilization (KubernetesMetrics or Prometheus), actualUsageClient.sync() and prometheusUsageClient.sync() treat a missing node as a hard error, aborting the entire balance cycle for all nodes. This means a single unreachable node (e.g. a roaming node with intermittent connectivity) prevents the descheduler from performing any load balancing. Change usageClient.sync() to return ([]*v1.Node, error) — the returned slice is the subset of input nodes for which usage data is available. Nodes without metrics are logged at V(1) and skipped. When all nodes are missing metrics, a V(0) message is logged indicating the balance cycle will be a no-op.

k8s-ci-robot · 2026-05-05T11:48:22Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ricardomaraschini · 2026-05-06T11:12:53Z

Before reviewing this, please rebase.

ingvagabund · 2026-05-11T10:59:28Z

When using metrics-based node utilization (KubernetesMetrics or Prometheus), the LowNodeUtilization and HighNodeUtilization plugins fail entirely if any Ready node is missing from the collected metrics. This happens when a node is Ready but unreachable by metrics-server.

This works by design. If either of the ready nodes is not reachable from the metrics server either of the descheduler plugins does not have the full resource view of the nodes. Thus, the decision making logic can not guarantee the utilization among ready nodes is computed properly.

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 13, 2026

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 13, 2026

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Mar 13, 2026

k8s-ci-robot requested review from jklaw90 and knelasevero March 13, 2026 19:41

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 13, 2026

agentydragon marked this pull request as ready for review March 13, 2026 19:52

Copilot AI review requested due to automatic review settings March 13, 2026 19:52

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 13, 2026

k8s-ci-robot requested review from a7i and damemi March 13, 2026 19:52

Copilot started reviewing on behalf of agentydragon March 13, 2026 19:52 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

Comment thread pkg/framework/plugins/nodeutilization/usageclients.go Outdated

Comment thread pkg/framework/plugins/nodeutilization/usageclients.go

Comment thread pkg/framework/plugins/nodeutilization/usageclients.go

Comment thread pkg/framework/plugins/nodeutilization/highnodeutilization.go

Copilot AI review requested due to automatic review settings March 13, 2026 21:11

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 13, 2026

Copilot started reviewing on behalf of agentydragon March 13, 2026 21:11 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

agentydragon force-pushed the skip-nodes-without-metrics branch from abcad73 to 6e16efc Compare March 14, 2026 00:48

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 14, 2026

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip nodes without metrics in nodeutilization plugins#1852

Skip nodes without metrics in nodeutilization plugins#1852
agentydragon wants to merge 1 commit into
kubernetes-sigs:masterfrom
agentydragon:skip-nodes-without-metrics

agentydragon commented Mar 13, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Mar 13, 2026

Uh oh!

k8s-ci-robot commented Mar 13, 2026

Uh oh!

linux-foundation-easycla Bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

k8s-ci-robot commented May 5, 2026

Uh oh!

ricardomaraschini commented May 6, 2026

Uh oh!

ingvagabund commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

agentydragon commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

k8s-ci-robot commented Mar 13, 2026

Uh oh!

k8s-ci-robot commented Mar 13, 2026

Uh oh!

linux-foundation-easycla Bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

k8s-ci-robot commented May 5, 2026

Uh oh!

ricardomaraschini commented May 6, 2026

Uh oh!

ingvagabund commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

agentydragon commented Mar 13, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Mar 13, 2026 •

edited

Loading