Skip to content

Skip nodes without metrics in nodeutilization plugins#1852

Open
agentydragon wants to merge 1 commit into
kubernetes-sigs:masterfrom
agentydragon:skip-nodes-without-metrics
Open

Skip nodes without metrics in nodeutilization plugins#1852
agentydragon wants to merge 1 commit into
kubernetes-sigs:masterfrom
agentydragon:skip-nodes-without-metrics

Conversation

@agentydragon
Copy link
Copy Markdown

@agentydragon agentydragon commented Mar 13, 2026

Description

When using metrics-based node utilization (KubernetesMetrics or Prometheus), the LowNodeUtilization and HighNodeUtilization plugins fail entirely if any Ready node is missing from the collected metrics. This happens when a node is Ready but unreachable by metrics-server.

The metrics collector already handles missing node metrics gracefully by logging an error and continuing. However, the downstream actualUsageClient.sync() and prometheusUsageClient.sync() methods treat a missing node as a hard error, aborting the entire balance cycle for all nodes.

Change the usageClient.sync() interface to return the subset of nodes for which metrics are available. Nodes without metrics are logged at V(1) and excluded from the balance computation rather than causing a fatal error. Both LowNodeUtilization and HighNodeUtilization now operate on the filtered node list.

Checklist

Please ensure your pull request meets the following criteria before submitting
for review, these items will be used by reviewers to assess the quality and
completeness of your changes:

  • Code Readability: Is the code easy to understand, well-structured, and consistent with project conventions?
  • Naming Conventions: Are variable, function, and structs descriptive and consistent?
  • Code Duplication: Is there any repeated code that should be refactored?
    • Some duplicated shapes between Balance implementations from KubernetesMetrics and Prometheus; pre-existing
  • Function/Method Size: Are functions/methods short and focused on a single task?
  • Comments & Documentation: Are comments clear, useful, and not excessive? Were comments updated where necessary?
  • Error Handling: Are errors handled appropriately ?
  • Testing: Are there sufficient unit/integration tests?
  • Performance: Are there any obvious performance issues or unnecessary computations?
  • Dependencies: Are new dependencies justified ?
  • Logging & Monitoring: Is logging used appropriately (not too verbose, not too silent)?
  • Backward Compatibility: Does this change break any existing functionality or APIs?
    • The usageClient interface is unexported, so I think this should be fine. This does change visible behavior in the "metrics server can't reach node" case but I think that's fixing a bug.
  • Resource Management: Are resources (files, connections, memory) managed and released properly?
  • PR Description: Is the PR description clear, providing enough context and explaining the motivation for the change?
  • Documentation & Changelog: Are README and docs updated if necessary?
    • README documents behavior of Low/HighNodeUtilization. I think this behavior is a minor implementation detail, but we can update the description if desired

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 13, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @agentydragon!

It looks like this is your first PR to kubernetes-sigs/descheduler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/descheduler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 13, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @agentydragon. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Mar 13, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: agentydragon / name: Rai (6e16efc)

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Mar 13, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ingvagabund for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 13, 2026
@agentydragon agentydragon marked this pull request as ready for review March 13, 2026 19:52
Copilot AI review requested due to automatic review settings March 13, 2026 19:52
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 13, 2026
@k8s-ci-robot k8s-ci-robot requested review from a7i and damemi March 13, 2026 19:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the nodeutilization plugins’ metrics-based usage clients so that nodes missing usage metrics are skipped (instead of aborting the entire balance cycle), improving resilience when metrics-server/Prometheus cannot provide data for all Ready nodes.

Changes:

  • Updated the internal usageClient.sync() contract to return the subset of nodes with available usage data.
  • Modified Low/HighNodeUtilization Balance flows to operate on the filtered node list returned by sync().
  • Added unit tests ensuring Actual and Prometheus usage clients skip nodes that have no collected metrics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
pkg/framework/plugins/nodeutilization/usageclients.go Changes sync() to return filtered nodes; updates actual/prometheus clients to skip nodes missing metrics instead of failing.
pkg/framework/plugins/nodeutilization/lownodeutilization.go Uses syncedNodes for snapshot/capacity computations and “all nodes underutilized” checks.
pkg/framework/plugins/nodeutilization/highnodeutilization.go Uses syncedNodes for snapshot/capacity computations and “all nodes underutilized” checks.
pkg/framework/plugins/nodeutilization/usageclients_test.go Updates existing tests for new sync() signature and adds skip-without-metrics test coverage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/framework/plugins/nodeutilization/usageclients.go Outdated
Comment thread pkg/framework/plugins/nodeutilization/usageclients.go
Comment thread pkg/framework/plugins/nodeutilization/usageclients.go
Comment thread pkg/framework/plugins/nodeutilization/highnodeutilization.go
Copilot AI review requested due to automatic review settings March 13, 2026 21:11
@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 13, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes metrics-based node utilization balancing (KubernetesMetrics / Prometheus) so that a single Ready node missing from collected metrics no longer aborts the entire LowNodeUtilization / HighNodeUtilization balance cycle. Instead, usage clients return only the subset of nodes with available metrics, and the plugins operate on that filtered list.

Changes:

  • Update the internal usageClient.sync() contract to return ([]*v1.Node, error) (nodes with metrics) instead of just error.
  • Make actualUsageClient and prometheusUsageClient skip nodes missing metrics (log at V(1)) rather than failing the whole cycle.
  • Add unit tests verifying both usage clients exclude nodes without metrics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
pkg/framework/plugins/nodeutilization/usageclients.go Changes sync() to return a filtered node list; implements “skip missing metrics” behavior for actual and Prometheus usage clients.
pkg/framework/plugins/nodeutilization/lownodeutilization.go Uses the filtered syncedNodes list for snapshot/capacity and related logic.
pkg/framework/plugins/nodeutilization/highnodeutilization.go Uses the filtered syncedNodes list for snapshot/capacity and related logic.
pkg/framework/plugins/nodeutilization/usageclients_test.go Updates existing tests for new sync() signature and adds coverage for skipping nodes without metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

When using metrics-based node utilization (KubernetesMetrics or
Prometheus), actualUsageClient.sync() and prometheusUsageClient.sync()
treat a missing node as a hard error, aborting the entire balance cycle
for all nodes. This means a single unreachable node (e.g. a roaming
node with intermittent connectivity) prevents the descheduler from
performing any load balancing.

Change usageClient.sync() to return ([]*v1.Node, error) — the returned
slice is the subset of input nodes for which usage data is available.
Nodes without metrics are logged at V(1) and skipped. When all nodes
are missing metrics, a V(0) message is logged indicating the balance
cycle will be a no-op.
@agentydragon agentydragon force-pushed the skip-nodes-without-metrics branch from abcad73 to 6e16efc Compare March 14, 2026 00:48
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 14, 2026
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 5, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ricardomaraschini
Copy link
Copy Markdown
Contributor

Before reviewing this, please rebase.

@ingvagabund
Copy link
Copy Markdown
Contributor

When using metrics-based node utilization (KubernetesMetrics or Prometheus), the LowNodeUtilization and HighNodeUtilization plugins fail entirely if any Ready node is missing from the collected metrics. This happens when a node is Ready but unreachable by metrics-server.

This works by design. If either of the ready nodes is not reachable from the metrics server either of the descheduler plugins does not have the full resource view of the nodes. Thus, the decision making logic can not guarantee the utilization among ready nodes is computed properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants