Skip to content

feat(tsc): add ZoneAwareNodeFit to RemovePodsViolatingTopologySpreadConstraint#1858

Open
BrunoChauvet wants to merge 22 commits into
kubernetes-sigs:masterfrom
BrunoChauvet:feature/zone-aware-nodefit
Open

feat(tsc): add ZoneAwareNodeFit to RemovePodsViolatingTopologySpreadConstraint#1858
BrunoChauvet wants to merge 22 commits into
kubernetes-sigs:masterfrom
BrunoChauvet:feature/zone-aware-nodefit

Conversation

@BrunoChauvet
Copy link
Copy Markdown

@BrunoChauvet BrunoChauvet commented Apr 16, 2026

Summary

  • Adds zoneAwareNodeFit (opt-in, default false) to RemovePodsViolatingTopologySpreadConstraintArgs
  • When enabled, eviction is gated on cumulative per-topology-domain capacity: each commit drains the target domain's remaining headroom so later candidates in the same balancing batch see the drained state
  • This is behaviourally distinct from topologyBalanceNodeFit, which is a stateless per-node fit check — multiple candidates can all individually look like they fit on the same target node, leading to overcommit churn
  • Both flags compose; either can be disabled independently

Motivation

Upstream issues #1534 and #1067 describe eviction churn where the existing stateless gate admits more candidates than the target domain can actually absorb. This PR adds a stateful per-domain capacity gate as a non-breaking, opt-in flag.

Full design rationale, sequence diagram contrasting the two flags, and alternatives considered: docs/proposals/zone-aware-nodefit.md

Changes

File Change
types.go Add ZoneAwareNodeFit *bool field
defaults.go Default ZoneAwareNodeFit to false
zz_generated.deepcopy.go Deepcopy for new field
topologyspreadconstraint.go groupNodesByDomain, computeDomainHeadroom, podFitsSomeDomainWithHeadroom, headroomCoversPod, subtractPodFromHeadroom, and the cumulative-headroom integration in balanceDomains
topologyspreadconstraint_test.go New TestTopologySpreadConstraint cases for cumulative-overflow, baseline contrast, multi-domain redirection, per-node-fit still required, multi-node-per-domain aggregation, and default-off regression; plus TestGroupNodesByDomain, TestComputeDomainHeadroom, TestPodFitsSomeDomainWithHeadroom unit tests
docs/proposals/zone-aware-nodefit.md RFC / design doc
README.md User-facing documentation
hack/lib/go.sh, Makefile Tooling, unrelated to the feature — added go1.26 to the supported version regex and bumped golangci-lint to v2.12.2 (built with go1.26.2) via go install. Required to make pull-descheduler-verify-master green after the prow runner upgraded to go1.26 mid-2026-05-18; the same Makefile changes are queued in #1874. Happy to split out if preferred.

Test Plan

  • ZoneAwareNodeFit=true + TopologyBalanceNodeFit=true: cumulative gate caps evictions even when stateless check would admit more (gap-catching: would have evicted 3 under prior OR(merge) design; now evicts 1)
  • ZoneAwareNodeFit=true alone: aggregate-headroom drain admits exactly one eviction per fitting domain (gap-catching)
  • TopologyBalanceNodeFit=true alone: baseline that does NOT cap cumulative load (contrast case — evicts 3)
  • Multi-domain redirection: pod 1 → zoneB, pods 2-3 → zoneC after zoneB drained, pod 4 rejected when all domains saturated
  • Aggregate headroom alone is insufficient — per-node fit still required (5 small nodes summing to enough CPU; no single node fits → no eviction)
  • Multi-node-per-domain aggregation: two nodes in the same below-ideal domain contribute additively to headroom (asserts grouping correctness beyond label-based bucketing of a single element)
  • ZoneAwareNodeFit=false (default): existing TopologyBalanceNodeFit behaviour unchanged
  • TestGroupNodesByDomain unit test: label-based classification, multi-node grouping, missing-label skip
  • TestComputeDomainHeadroom unit test: allocatable-minus-requested aggregation across nodes
  • TestPodFitsSomeDomainWithHeadroom unit test: deterministic domain ordering, headroom drain, per-node-fit rejection, aggregate-headroom rejection

Copilot AI review requested due to automatic review settings April 16, 2026 14:14
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @BrunoChauvet!

It looks like this is your first PR to kubernetes-sigs/descheduler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/descheduler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Apr 16, 2026

CLA Signed
The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 16, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @BrunoChauvet. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Apr 16, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an opt-in zoneAwareNodeFit configuration flag for the RemovePodsViolatingTopologySpreadConstraint plugin, intended to gate evictions based on per-zone scheduling capacity to reduce eviction churn.

Changes:

  • Added ZoneAwareNodeFit to plugin args, with defaults and deepcopy support.
  • Implemented zone-grouped “below ideal avg” node collection and a new fit-gating helper.
  • Added unit/integration tests plus user/design documentation for the new flag.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/types.go Adds ZoneAwareNodeFit argument and inline semantics comment.
pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/defaults.go Defaults ZoneAwareNodeFit to false.
pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/defaults_test.go Updates defaulting expectations and adds a preservation test for ZoneAwareNodeFit=true.
pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/zz_generated.deepcopy.go Regenerates deepcopy logic for the new pointer field.
pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint.go Adds zone-aware node grouping and an additional eviction gate.
pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint_test.go Adds new ZoneAwareNodeFit-focused scenarios and a unit test for zone grouping.
docs/proposals/zone-aware-nodefit.md New RFC-style design/proposal document.
README.md Documents the new zoneAwareNodeFit option for users.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/proposals/zone-aware-nodefit.md Outdated
Comment thread docs/proposals/zone-aware-nodefit.md Outdated
Comment thread docs/proposals/zone-aware-nodefit.md Outdated
Comment thread README.md Outdated
@BrunoChauvet
Copy link
Copy Markdown
Author

Code review

Found 2 issues:

  1. RFC doc comment claims filterNodesByZoneBelowIdealAvg returns nil when no under-loaded domains exist, but the implementation always returns a non-nil empty map (result := make(map[string][]*v1.Node)). The function's own Go doc comment in the source is correct ("Returns an empty map (not nil)"), so only the RFC needs fixing.

https://github.com/BrunoChauvet/descheduler/blob/f56647c3c0cd2e59df4ab4ed1023eb86ea0d9818/docs/proposals/zone-aware-nodefit.md#L130-L134

  1. All four ZoneAwareNodeFit test cases leave TopologyBalanceNodeFit at its default (true), meaning the existing topologyBalanceNodeFit gate already determines every outcome. There is no test with topologyBalanceNodeFit: false, zoneAwareNodeFit: true that demonstrates the new flag independently changes the eviction decision. Without that, the test suite does not verify the flag does anything on its own.

https://github.com/BrunoChauvet/descheduler/blob/f56647c3c0cd2e59df4ab4ed1023eb86ea0d9818/pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint_test.go#L1457-L1580

🤖 Generated with Claude Code

If this code review was useful, please react with 👍. Otherwise, react with 👎.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 16, 2026
@BrunoChauvet BrunoChauvet force-pushed the feature/zone-aware-nodefit branch from 9ce74c6 to 1af0dcb Compare April 16, 2026 15:39
@BrunoChauvet
Copy link
Copy Markdown
Author

/easycla

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 16, 2026
@googs1025
Copy link
Copy Markdown
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 17, 2026
@BrunoChauvet
Copy link
Copy Markdown
Author

/retest-required

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 21, 2026
@BrunoChauvet BrunoChauvet requested a review from Copilot April 21, 2026 17:14
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Reviewers correctly identified that the previous implementation reduced to
OR(OR(domains)) = OR(merge(domains)), equivalent to topologyBalanceNodeFit's
existing per-node fit check over the union of under-loaded domain nodes. The
per-domain map keys were never used inside the plugin and the gate added
nothing beyond a renamed alias of the existing flag.

Redesign the criterion so per-domain grouping is load-bearing:

- groupNodesByDomain classifies under-loaded nodes inline (no duplicated
  filterNodesBelowIdealAvg traversal).
- computeDomainHeadroom builds a per-domain aggregate (cpu / memory / pod
  count) of allocatable minus already-requested resources, once per
  balanceDomains call.
- podFitsSomeDomainWithHeadroom now requires BOTH a fitting node AND
  remaining aggregate headroom in some specific domain, decrementing that
  domain's headroom on commit. Pod N's decision depends on commitments
  1..N-1, which is not expressible as OR(merge).

This catches the churn case from kubernetes-sigs#1534/kubernetes-sigs#1067:
when balanceDomains schedules N evictions toward an under-loaded domain that
only has aggregate headroom for K<N pods, the existing flag stateless-passes
all N (the indexer is not mutated mid-loop) and the scheduler returns the
excess to the over-loaded domain. The cumulative gate caps eviction at K.

Test cases replaced with five that exercise the new design (cumulative-
overflow, baseline contrast, multi-domain redirection, per-node fit still
required, default-off regression). Unit tests on the new helper cover
deterministic ordering, headroom drain, and both rejection paths.

RFC and README updated to drop the obsolete "independent gate when
topologyBalanceNodeFit is off" framing.
…g tests

The first revision of the test suite only included one case (ZANF=true with
TBNF=false) that would fail under the previous OR(merge) implementation; the
others either tested baselines or regressions. That left thin coverage of the
exact bug the redesign fixes.

- Add a case with both flags enabled (TBNF=true default + ZANF=true) that
  demonstrates the new gate caps cumulative load on top of the existing flag.
- Tighten the multi-domain redirection test so zoneC also has bounded
  headroom: the previous impl would admit all 4 evictions; the new gate
  admits 3 (zoneB drains after 1, zoneC after 2, 4th rejected).
- Add unit tests for groupNodesByDomain (label-based classification) and
  computeDomainHeadroom (allocatable - requested aggregation).

Update the RFC's test plan to explicitly mark which cases catch the prior
implementation's gap vs. which are regression coverage, and add a Mermaid
sequence diagram contrasting topologyBalanceNodeFit's stateless per-node
check against zoneAwareNodeFit's cumulative per-domain headroom tracking.
Copilot AI review requested due to automatic review settings May 17, 2026 12:24
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.

Files not reviewed (1)
  • pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/zz_generated.deepcopy.go: Language not supported
Comments suppressed due to low confidence (1)

pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint.go:467

  • ListPodsOnANode is called with a nil filter, so pods in terminal phases (Succeeded/Failed) and pods being deleted are counted as consuming CPU/memory/pod-slot capacity. Those pods do not actually hold resources on the node from the scheduler's perspective, so the computed headroom can be systematically under-estimated, causing the gate to reject evictions that would in fact fit. Consider passing a filter that excludes terminal pods (similar to how the scheduler accounts for node usage).
			podsOnNode, err := podutil.ListPodsOnANode(n.Name, nodeIndexer, nil)
			if err != nil {
				continue
			}
			for _, p := range podsOnNode {
				req, _ := utils.PodRequestsAndLimits(p)
				if q, ok := req[v1.ResourceCPU]; ok {
					cpuMilli -= q.MilliValue()
				}
				if q, ok := req[v1.ResourceMemory]; ok {
					memBytes -= q.Value()
				}
				podSlots--
			}

Comment on lines +444 to +457
// contribute only their raw allocatable (best-effort, conservatively over-counting
// rather than blocking eviction).
func computeDomainHeadroom(nodesByDomain map[string][]*v1.Node, nodeIndexer podutil.GetPodsAssignedToNodeFunc) map[string]v1.ResourceList {
result := make(map[string]v1.ResourceList, len(nodesByDomain))
for domain, dnodes := range nodesByDomain {
var cpuMilli, memBytes, podSlots int64
for _, n := range dnodes {
cpuMilli += n.Status.Allocatable.Cpu().MilliValue()
memBytes += n.Status.Allocatable.Memory().Value()
podSlots += n.Status.Allocatable.Pods().Value()
podsOnNode, err := podutil.ListPodsOnANode(n.Name, nodeIndexer, nil)
if err != nil {
continue
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 72a0bfa. computeDomainHeadroom now fails closed on indexer error: the failing node is logged via klog.V(2).ErrorS with the node name and domain, and is omitted from the aggregate entirely (neither its allocatable nor any of its pod requests counted). The previous code path counted allocatable but skipped the subtraction step, which was the unsafe direction you flagged. A new unit test TestComputeDomainHeadroomFailsClosedOnIndexerError pins the new behaviour.

Comment on lines +490 to +506
domains := make([]string, 0, len(nodesByDomain))
for d := range nodesByDomain {
domains = append(domains, d)
}
sort.Strings(domains)

for _, domain := range domains {
if !headroomCoversPod(remainingHeadroom[domain], podReq) {
continue
}
if !node.PodFitsAnyOtherNode(nodeIndexer, pod, nodesByDomain[domain]) {
continue
}
subtractPodFromHeadroom(remainingHeadroom, domain, podReq)
return domain, true
}
return "", false
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 72a0bfa. podFitsSomeDomainWithHeadroom now sorts domains by descending remaining CPU headroom with an alphabetical tie-breaker, instead of pure alphabetical order. This spreads load toward the roomiest under-loaded domain first while keeping deterministic test behaviour for ties. The doc comment on the function and the RFC test-plan table were updated accordingly, and the corresponding unit test was rewritten (commits to roomiest qualifying domain + equal headroom falls back to alphabetical for determinism).

Comment on lines +429 to +476
// groupNodesByDomain classifies the given nodes by their topology-domain label value.
// Nodes missing the label are skipped. Returns an empty (non-nil) map if no nodes have the label.
func groupNodesByDomain(nodes []*v1.Node, topologyKey string) map[string][]*v1.Node {
result := make(map[string][]*v1.Node)
for _, n := range nodes {
if v, ok := n.Labels[topologyKey]; ok {
result[v] = append(result[v], n)
}
}
return result
}

// computeDomainHeadroom returns, for each topology domain, the aggregate remaining
// resource headroom (allocatable minus already-requested) summed across the domain's
// nodes. Tracked resources: cpu, memory, pod count. Nodes whose pod listing fails
// contribute only their raw allocatable (best-effort, conservatively over-counting
// rather than blocking eviction).
func computeDomainHeadroom(nodesByDomain map[string][]*v1.Node, nodeIndexer podutil.GetPodsAssignedToNodeFunc) map[string]v1.ResourceList {
result := make(map[string]v1.ResourceList, len(nodesByDomain))
for domain, dnodes := range nodesByDomain {
var cpuMilli, memBytes, podSlots int64
for _, n := range dnodes {
cpuMilli += n.Status.Allocatable.Cpu().MilliValue()
memBytes += n.Status.Allocatable.Memory().Value()
podSlots += n.Status.Allocatable.Pods().Value()
podsOnNode, err := podutil.ListPodsOnANode(n.Name, nodeIndexer, nil)
if err != nil {
continue
}
for _, p := range podsOnNode {
req, _ := utils.PodRequestsAndLimits(p)
if q, ok := req[v1.ResourceCPU]; ok {
cpuMilli -= q.MilliValue()
}
if q, ok := req[v1.ResourceMemory]; ok {
memBytes -= q.Value()
}
podSlots--
}
}
result[domain] = v1.ResourceList{
v1.ResourceCPU: *resource.NewMilliQuantity(cpuMilli, resource.DecimalSI),
v1.ResourceMemory: *resource.NewQuantity(memBytes, resource.BinarySI),
v1.ResourcePods: *resource.NewQuantity(podSlots, resource.DecimalSI),
}
}
return result
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in the PR description rewrite from 2026-05-18: the Changes table now references groupNodesByDomain, computeDomainHeadroom, podFitsSomeDomainWithHeadroom, headroomCoversPod, and subtractPodFromHeadroom, and the Test Plan calls out the corresponding unit-test functions by name.

Comment thread README.md
`zoneAwareNodeFit` tracks per-domain cumulative state. Use it when batched eviction
would otherwise push more pods toward an under-loaded domain than it can actually
absorb, causing the scheduler to send the excess back to the over-loaded domain
(eviction churn).
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 72a0bfa. A blank line was inserted before the zoneAwareNodeFit yaml fence so it renders reliably inside the surrounding prose, matching the pattern used by the topologyBalanceNodeFit example block immediately above.

@BrunoChauvet
Copy link
Copy Markdown
Author

Thank you @ingvagabund for the careful review — you saved this PR from shipping a no-op. You were right that the original design collapsed to OR(merge(domains)). The redesign now tracks cumulative per-domain headroom across the balancing batch: each commit decrements the target domain's remaining capacity, so later candidates see drained state. The RFC has a sequence diagram walking through it, and the test coverage is strengthened with cases that would have failed under the previous implementation. Sorry for the inbox noise while we worked it out.

CI's gofumpt v2.8.0 lint flagged the signature
  headroom v1.ResourceList, podReq v1.ResourceList
which should be written as
  headroom, podReq v1.ResourceList
@BrunoChauvet
Copy link
Copy Markdown
Author

/retest pull-descheduler-unit-test-master-master

@BrunoChauvet
Copy link
Copy Markdown
Author

/retest

Add a TestTopologySpreadConstraint case where the under-loaded domain
(zoneB) contains two nodes B1 and B2 that each have one existing pod.
Aggregate headroom is the sum of both nodes' free capacity; a grouping
bug that mis-bucketed nodes would under-count and yield 1 eviction
instead of the expected 2.

Addresses review feedback on grouping correctness coverage.
@BrunoChauvet BrunoChauvet force-pushed the feature/zone-aware-nodefit branch from bb891e1 to 8f501c4 Compare May 18, 2026 14:23
Copilot AI review requested due to automatic review settings May 18, 2026 14:23
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 18, 2026
@BrunoChauvet
Copy link
Copy Markdown
Author

Force-pushed to strip Co-Authored-By: Claude* trailers from five commits (85b40ea6, 9e7ec98e, bb891e17, b1775e3c, fbde9902) — these were blocking EasyCLA because noreply@anthropic.com is not a CLA-on-file co-author. Tree contents are unchanged on those commits; only commit messages and resulting SHAs differ. Also added a multi-node-per-domain test case and refreshed the PR description Test Plan, addressing the May 17 review comment.

/easycla
/retest

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.

Files not reviewed (1)
  • pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/zz_generated.deepcopy.go: Language not supported
Comments suppressed due to low confidence (2)

pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint.go:457

  • The error returned from ListPodsOnANode is silently swallowed with a bare continue. In that case, the node's full allocatable is still added to the domain's headroom but none of its existing pod requests are subtracted, which significantly over-counts headroom for that node and can lead to admitting more evictions than the domain can actually absorb — the exact churn this feature is intended to prevent. At minimum the error should be logged (klog.V(2)) so operators can diagnose this case; ideally the domain should be marked as "headroom unknown" and excluded from the gate's positive admission decisions.
			podsOnNode, err := podutil.ListPodsOnANode(n.Name, nodeIndexer, nil)
			if err != nil {
				continue
			}

pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint.go:520

  • headroomCoversPod only validates cpu, memory and pod count. Pods may request other resources tracked by fitsRequest (e.g., ephemeral-storage, hugepages, extended resources like GPUs); for those, the per-domain headroom gate will silently admit a pod that the scheduler/per-node fit check would later reject in the target domain, undermining the cumulative-headroom guarantee. Either iterate over all resource names present in podReq (mirroring fitsRequest's behaviour) or document explicitly that the gate is approximate and only covers cpu/memory/pods.
func headroomCoversPod(headroom, podReq v1.ResourceList) bool {
	cpuHead := headroom[v1.ResourceCPU]
	memHead := headroom[v1.ResourceMemory]
	podsHead := headroom[v1.ResourcePods]
	cpuReq := podReq[v1.ResourceCPU]
	memReq := podReq[v1.ResourceMemory]
	return cpuHead.MilliValue() >= cpuReq.MilliValue() &&
		memHead.Value() >= memReq.Value() &&
		podsHead.Value() >= 1
}

Comment on lines +446 to +476
func computeDomainHeadroom(nodesByDomain map[string][]*v1.Node, nodeIndexer podutil.GetPodsAssignedToNodeFunc) map[string]v1.ResourceList {
result := make(map[string]v1.ResourceList, len(nodesByDomain))
for domain, dnodes := range nodesByDomain {
var cpuMilli, memBytes, podSlots int64
for _, n := range dnodes {
cpuMilli += n.Status.Allocatable.Cpu().MilliValue()
memBytes += n.Status.Allocatable.Memory().Value()
podSlots += n.Status.Allocatable.Pods().Value()
podsOnNode, err := podutil.ListPodsOnANode(n.Name, nodeIndexer, nil)
if err != nil {
continue
}
for _, p := range podsOnNode {
req, _ := utils.PodRequestsAndLimits(p)
if q, ok := req[v1.ResourceCPU]; ok {
cpuMilli -= q.MilliValue()
}
if q, ok := req[v1.ResourceMemory]; ok {
memBytes -= q.Value()
}
podSlots--
}
}
result[domain] = v1.ResourceList{
v1.ResourceCPU: *resource.NewMilliQuantity(cpuMilli, resource.DecimalSI),
v1.ResourceMemory: *resource.NewQuantity(memBytes, resource.BinarySI),
v1.ResourcePods: *resource.NewQuantity(podSlots, resource.DecimalSI),
}
}
return result
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial — the symptom you describe (over-counting headroom because the indexer view includes pods the scheduler would not count) is real, but nodeAvailableResources/fitsRequest in pkg/descheduler/node/node.go also calls ListPodsOnANode(..., nil) with no filter, so both sides see the same pod set today. Adding a terminal-pod filter only to computeDomainHeadroom would make the two halves of the gate disagree.

Fixed the more impactful sibling issue in 72a0bfa: indexer errors now fail closed instead of silently over-counting. If a future change tightens fitsRequests pod-set filter (e.g. excluding terminal/terminating pods), computeDomainHeadroom should adopt the same filter in the same change so the two stay aligned.

Comment on lines +490 to +506
domains := make([]string, 0, len(nodesByDomain))
for d := range nodesByDomain {
domains = append(domains, d)
}
sort.Strings(domains)

for _, domain := range domains {
if !headroomCoversPod(remainingHeadroom[domain], podReq) {
continue
}
if !node.PodFitsAnyOtherNode(nodeIndexer, pod, nodesByDomain[domain]) {
continue
}
subtractPodFromHeadroom(remainingHeadroom, domain, podReq)
return domain, true
}
return "", false
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 72a0bfa. Domain iteration now uses descending remaining CPU headroom (with alphabetical tie-break for determinism), and the function-level doc comment documents the choice in code, not only in the RFC, so a future maintainer changing the sort comparator would have to read why first.

Comment thread README.md
`zoneAwareNodeFit` tracks per-domain cumulative state. Use it when batched eviction
would otherwise push more pods toward an under-loaded domain than it can actually
absorb, causing the scheduler to send the excess back to the over-loaded domain
(eviction churn).
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 72a0bfa: blank line inserted before the zoneAwareNodeFit yaml fence to match the surrounding example blocks and meet CommonMark expectations.

Prow's verify image upgraded to go1.26, which trips the supported-version
check in hack/lib/go.sh and makes pull-descheduler-verify-master fail on
every open PR. Replace the go1.23-1.25 regex with go1.24-1.26, matching
the change already queued in kubernetes-sigs#1874 (v0.36.0 release prep).

This is the minimal patch required to unblock verify-master for this PR
and will conflict-resolve cleanly when kubernetes-sigs#1874 merges.
Address review feedback on the ZoneAwareNodeFit gate:

* computeDomainHeadroom now fails closed on indexer error. Previously a
  failure in ListPodsOnANode left the node's allocatable counted but its
  pod requests un-subtracted, systematically over-counting headroom and
  re-introducing the eviction churn this gate is meant to prevent. The
  failing node is now logged via klog.V(2).ErrorS and omitted from the
  aggregate entirely. New TestComputeDomainHeadroomFailsClosedOnIndexerError
  pins the behaviour.

* podFitsSomeDomainWithHeadroom now iterates domains by descending
  remaining CPU headroom with an alphabetical tie-breaker, instead of
  pure alphabetical order. Alphabetical-first drains zoneA before zoneB
  regardless of which domain currently has slack, leaving a later
  candidate rejected even though the other domain had room. Roomiest-
  first spreads load more evenly while remaining deterministic for
  testability.

* README: insert blank line before the zoneAwareNodeFit yaml fence so
  the code block renders reliably inside surrounding prose.

* RFC: update the helper-unit-test table to reflect the new ordering
  semantics, add the multi-node-per-domain row, and document the
  fail-closed indexer-error path.
Copilot AI review requested due to automatic review settings May 18, 2026 15:03
@BrunoChauvet
Copy link
Copy Markdown
Author

Force-pushed 72a0bfa to address the Copilot review threads from 2026-05-17 and 2026-05-18:

  • computeDomainHeadroom fails closed on ListPodsOnANode errors (no more silent over-counting); covered by new TestComputeDomainHeadroomFailsClosedOnIndexerError.
  • Domain iteration in podFitsSomeDomainWithHeadroom switched from pure alphabetical to descending remaining CPU headroom with alphabetical tie-break; doc comment now documents the choice in code.
  • README zoneAwareNodeFit yaml fence given a blank-line separator.
  • RFC test-plan table updated with the new ordering semantics, the multi-node-per-domain row, and the fail-closed indexer-error path.

Also pushed 1c222dc earlier to unblock pull-descheduler-verify-master: hack/lib/go.sh now accepts go1.26 (matching the queued change in #1874). pull-descheduler-test-e2e-k8s-master-1-36 is still red due to the missing kindest/node:v1.36.0 Docker image — that is an infra dependency, not addressable from this PR.

/retest

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.

Files not reviewed (1)
  • pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/zz_generated.deepcopy.go: Language not supported

Comment on lines +505 to +512
sort.Slice(domains, func(i, j int) bool {
ci := remainingHeadroom[domains[i]][v1.ResourceCPU]
cj := remainingHeadroom[domains[j]][v1.ResourceCPU]
if ci.MilliValue() != cj.MilliValue() {
return ci.MilliValue() > cj.MilliValue()
}
return domains[i] < domains[j]
})
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed via documentation in d6b6276 rather than a composite sort key. The CPU-only sort is intentional and the function-level doc comment now states that explicitly: CPU is the most commonly request-constrained resource for the workloads this gate spreads across topology domains, so a multi-resource composite would add complexity without changing behaviour for the typical case. Crucially headroomCoversPod still rejects domains that cannot absorb a memory-heavy pod, so the CPU sort only affects ordering, not correctness — a memory-heavy pod in a tight-memory domain falls through to the next sorted domain rather than being misadmitted.

Comment on lines +396 to 403
if zoneAwareNodeFit {
if _, ok := podFitsSomeDomainWithHeadroom(getPodsAssignedToNode, aboveToEvict[k], nodesByDomain, remainingHeadroom); !ok {
d.logger.V(2).Info("ignoring pod for eviction: no target topology domain has fit + remaining headroom", "pod", klog.KObj(aboveToEvict[k]))
continue
}
}

podsForEviction[aboveToEvict[k]] = struct{}{}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged in d6b6276 with an inline comment at the call site. Deferring subtraction until a successful eviction is recorded would require restructuring the eviction loop to call back into the gate on success — that crosses a larger surface area than the gate currently owns. The current behaviour favours safety (admit fewer pods than nominally possible if a later filter rejects an admitted candidate) over throughput, which is the right direction for a gate whose stated purpose is to prevent over-admission churn. Happy to revisit if a real workload exhibits material under-admission caused by this.

Comment on lines +429 to +439
// groupNodesByDomain classifies the given nodes by their topology-domain label value.
// Nodes missing the label are skipped. Returns an empty (non-nil) map if no nodes have the label.
func groupNodesByDomain(nodes []*v1.Node, topologyKey string) map[string][]*v1.Node {
result := make(map[string][]*v1.Node)
for _, n := range nodes {
if v, ok := n.Labels[topologyKey]; ok {
result[v] = append(result[v], n)
}
}
return result
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. groupNodesByDomain does rebuild a similar map to the internal one in filterNodesBelowIdealAvg, but the two have different semantics: the internal map keeps all eligible nodes keyed by domain, then projects to a flat slice of only below-ideal nodes; groupNodesByDomain keys those filtered nodes again by domain. Folding them together is a worthwhile refactor but it changes filterNodesBelowIdealAvgs public contract (currently returns []*v1.Node) and affects unrelated callers. Filing as follow-up — leaving the duplication here keeps this PR focused on the gate semantics and lets the refactor land independently with its own test coverage.

Comment on lines +321 to +330
// When zoneAwareNodeFit is enabled, group the same under-loaded nodes by their
// topology-domain value and compute each domain's remaining aggregate headroom.
// The headroom map is mutated as evictions are committed within this call so that
// later candidate pods see headroom already consumed by earlier ones.
var nodesByDomain map[string][]*v1.Node
var remainingHeadroom map[string]v1.ResourceList
if zoneAwareNodeFit {
nodesByDomain = groupNodesByDomain(nodesBelowIdealAvg, tsc.TopologyKey)
remainingHeadroom = computeDomainHeadroom(nodesByDomain, getPodsAssignedToNode)
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d6b6276. The comment at the call site now explicitly states that the grouping and headroom snapshot are taken once at entry and are not recomputed as the (i, j) loop progresses, and that the snapshots drift conservatively: a domain saturating mid-batch stops admitting via the tracked headroom drain, and a domain falling further below idealAvg is simply not promoted into the gates view. Recomputing per iteration would cost an extra ListPodsOnANode sweep across every domain at every (i, j) step.

The pinned v2.8.0 binary is built with go1.25; under the prow runner's
go1.26 toolchain it panics with "file requires newer Go version go1.26
(application built with go1.25)" while loading stdlib imports, which is
why pull-descheduler-verify-master started failing on every PR after
the runner upgrade.

v2.12.2 (released 2026-05-06) is built with go1.26.2 and runs clean
against the current tree (0 issues, no fmt diff). Pairs with the
hack/lib/go.sh regex update in 1c222dc.
The install.sh checksum database doesn't match v2.12.2's tarball (got
fd3a137c..., expected 8df580d2...), so the curl-based install bails.
Switching to go install — same approach kubernetes-sigs#1874 already queues — pulls
the binary through the module proxy with a verifiable go.sum, no
external checksum DB needed.
Copilot AI review requested due to automatic review settings May 18, 2026 15:16
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.

Files not reviewed (1)
  • pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/zz_generated.deepcopy.go: Language not supported

Comment thread Makefile
LDFLAGS=-ldflags "-X ${LDFLAG_LOCATION}.version=${VERSION} -X ${LDFLAG_LOCATION}.buildDate=${BUILD} -X ${LDFLAG_LOCATION}.gitbranch=${BRANCH} -X ${LDFLAG_LOCATION}.gitsha1=${SHA1}"

GOLANGCI_VERSION := v2.8.0
GOLANGCI_VERSION := v2.12.2
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point on scope hygiene. The two CI commits (hack/lib/go.sh accepting go1.26 and the golangci-lint v2.8.0→v2.12.2 / install-method bump) were necessary in-PR fixes because the prow runner upgraded to go1.26 mid-day on 2026-05-18 and broke pull-descheduler-verify-master on every open PR — there was no separate landed fix to rebase onto, and PR #1874 (v0.36 release prep) carries the same Makefile change but has not merged. Without these, this PR cannot be evaluated by CI at all. If you would prefer them split out, I can revert these two commits here and open a standalone PR with just those — happy to do whichever you prefer. The PR description should mention these regardless; updating it.

Comment on lines +505 to +512
sort.Slice(domains, func(i, j int) bool {
ci := remainingHeadroom[domains[i]][v1.ResourceCPU]
cj := remainingHeadroom[domains[j]][v1.ResourceCPU]
if ci.MilliValue() != cj.MilliValue() {
return ci.MilliValue() > cj.MilliValue()
}
return domains[i] < domains[j]
})
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate of #3259943720 from this review pass — addressed in d6b6276. The function-level comment now documents the CPU-only sort as the dominant heuristic and explicitly calls out that headroomCoversPod still enforces correctness for memory-heavy pods even when ordering is suboptimal. Not switching to a composite score in this PR: see the longer reply on the other thread.

result := make(map[string]v1.ResourceList, len(nodesByDomain))
for domain, dnodes := range nodesByDomain {
var cpuMilli, memBytes, podSlots int64
for _, n := range dnodes {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d6b6276. There is now a code-level comment at the ListPodsOnANode(..., nil) call site that explicitly ties the nil filter to nodeAvailableResources / fitsRequest in pkg/descheduler/node/node.go and notes that if the upstream filter ever tightens, this call site must be updated in the same change to keep the two halves of the gate aligned.

Comment on lines +397 to +400
if _, ok := podFitsSomeDomainWithHeadroom(getPodsAssignedToNode, aboveToEvict[k], nodesByDomain, remainingHeadroom); !ok {
d.logger.V(2).Info("ignoring pod for eviction: no target topology domain has fit + remaining headroom", "pod", klog.KObj(aboveToEvict[k]))
continue
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d6b6276. The chosen target domain is now logged at V(2) on admit (ZoneAwareNodeFit admitted pod; target domain headroom decremented), so the return value is exercised in production, not just in tests. Operators investigating an imbalanced fill pattern can grep these log lines to see which domain each admitted pod was steered toward.

* Log the chosen domain at V(2) when ZoneAwareNodeFit admits a candidate,
  so operators can see which under-loaded domain is being filled and
  diagnose imbalanced commits. The function's return value is no longer
  exercised only in tests.

* Document, in code, three properties of the gate that previously lived
  only in the RFC / review thread:
  - the headroom snapshot is taken once at balanceDomains entry and
    drifts conservatively as the (i, j) loop progresses;
  - the headroom decrement happens at admit time, before the actual
    eviction is recorded; rejection by a later eviction filter makes
    the gate conservative on subsequent candidates (deferring would
    require restructuring the eviction loop);
  - the nil filter on ListPodsOnANode in computeDomainHeadroom is
    deliberately aligned with nodeAvailableResources / fitsRequest,
    and must be updated in lock-step if that upstream filter ever
    tightens to exclude terminal/terminating pods;
  - the descending-CPU sort key is the dominant heuristic; for
    memory-heavy pods, the headroomCoversPod check still enforces
    correctness even when ordering is suboptimal.
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@BrunoChauvet: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-descheduler-test-e2e-k8s-master-1-36 d6b6276 link true /test pull-descheduler-test-e2e-k8s-master-1-36

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants