Skip nodes without metrics in nodeutilization plugins#1852
Conversation
|
Welcome @agentydragon! |
|
Hi @agentydragon. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR adjusts the nodeutilization plugins’ metrics-based usage clients so that nodes missing usage metrics are skipped (instead of aborting the entire balance cycle), improving resilience when metrics-server/Prometheus cannot provide data for all Ready nodes.
Changes:
- Updated the internal
usageClient.sync()contract to return the subset of nodes with available usage data. - Modified Low/HighNodeUtilization Balance flows to operate on the filtered node list returned by
sync(). - Added unit tests ensuring Actual and Prometheus usage clients skip nodes that have no collected metrics.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| pkg/framework/plugins/nodeutilization/usageclients.go | Changes sync() to return filtered nodes; updates actual/prometheus clients to skip nodes missing metrics instead of failing. |
| pkg/framework/plugins/nodeutilization/lownodeutilization.go | Uses syncedNodes for snapshot/capacity computations and “all nodes underutilized” checks. |
| pkg/framework/plugins/nodeutilization/highnodeutilization.go | Uses syncedNodes for snapshot/capacity computations and “all nodes underutilized” checks. |
| pkg/framework/plugins/nodeutilization/usageclients_test.go | Updates existing tests for new sync() signature and adds skip-without-metrics test coverage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
This PR fixes metrics-based node utilization balancing (KubernetesMetrics / Prometheus) so that a single Ready node missing from collected metrics no longer aborts the entire LowNodeUtilization / HighNodeUtilization balance cycle. Instead, usage clients return only the subset of nodes with available metrics, and the plugins operate on that filtered list.
Changes:
- Update the internal
usageClient.sync()contract to return([]*v1.Node, error)(nodes with metrics) instead of justerror. - Make
actualUsageClientandprometheusUsageClientskip nodes missing metrics (log at V(1)) rather than failing the whole cycle. - Add unit tests verifying both usage clients exclude nodes without metrics.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
pkg/framework/plugins/nodeutilization/usageclients.go |
Changes sync() to return a filtered node list; implements “skip missing metrics” behavior for actual and Prometheus usage clients. |
pkg/framework/plugins/nodeutilization/lownodeutilization.go |
Uses the filtered syncedNodes list for snapshot/capacity and related logic. |
pkg/framework/plugins/nodeutilization/highnodeutilization.go |
Uses the filtered syncedNodes list for snapshot/capacity and related logic. |
pkg/framework/plugins/nodeutilization/usageclients_test.go |
Updates existing tests for new sync() signature and adds coverage for skipping nodes without metrics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
When using metrics-based node utilization (KubernetesMetrics or Prometheus), actualUsageClient.sync() and prometheusUsageClient.sync() treat a missing node as a hard error, aborting the entire balance cycle for all nodes. This means a single unreachable node (e.g. a roaming node with intermittent connectivity) prevents the descheduler from performing any load balancing. Change usageClient.sync() to return ([]*v1.Node, error) — the returned slice is the subset of input nodes for which usage data is available. Nodes without metrics are logged at V(1) and skipped. When all nodes are missing metrics, a V(0) message is logged indicating the balance cycle will be a no-op.
abcad73 to
6e16efc
Compare
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Before reviewing this, please rebase. |
This works by design. If either of the ready nodes is not reachable from the metrics server either of the descheduler plugins does not have the full resource view of the nodes. Thus, the decision making logic can not guarantee the utilization among ready nodes is computed properly. |
Description
When using metrics-based node utilization (KubernetesMetrics or Prometheus), the LowNodeUtilization and HighNodeUtilization plugins fail entirely if any Ready node is missing from the collected metrics. This happens when a node is Ready but unreachable by metrics-server.
The metrics collector already handles missing node metrics gracefully by logging an error and continuing. However, the downstream actualUsageClient.sync() and prometheusUsageClient.sync() methods treat a missing node as a hard error, aborting the entire balance cycle for all nodes.
Change the usageClient.sync() interface to return the subset of nodes for which metrics are available. Nodes without metrics are logged at V(1) and excluded from the balance computation rather than causing a fatal error. Both LowNodeUtilization and HighNodeUtilization now operate on the filtered node list.
Checklist
Please ensure your pull request meets the following criteria before submitting
for review, these items will be used by reviewers to assess the quality and
completeness of your changes: