feat(tsc): add ZoneAwareNodeFit to RemovePodsViolatingTopologySpreadConstraint by BrunoChauvet · Pull Request #1858 · kubernetes-sigs/descheduler

BrunoChauvet · 2026-04-16T14:14:19Z

Summary

Adds zoneAwareNodeFit (opt-in, default false) to RemovePodsViolatingTopologySpreadConstraintArgs
When enabled, eviction is gated on cumulative per-topology-domain capacity: each commit drains the target domain's remaining headroom so later candidates in the same balancing batch see the drained state
This is behaviourally distinct from topologyBalanceNodeFit, which is a stateless per-node fit check — multiple candidates can all individually look like they fit on the same target node, leading to overcommit churn
Both flags compose; either can be disabled independently

Motivation

Upstream issues #1534 and #1067 describe eviction churn where the existing stateless gate admits more candidates than the target domain can actually absorb. This PR adds a stateful per-domain capacity gate as a non-breaking, opt-in flag.

Full design rationale, sequence diagram contrasting the two flags, and alternatives considered: docs/proposals/zone-aware-nodefit.md

Changes

File	Change
`types.go`	Add `ZoneAwareNodeFit *bool` field
`defaults.go`	Default `ZoneAwareNodeFit` to `false`
`zz_generated.deepcopy.go`	Deepcopy for new field
`topologyspreadconstraint.go`	`groupNodesByDomain`, `computeDomainHeadroom`, `podFitsSomeDomainWithHeadroom`, `headroomCoversPod`, `subtractPodFromHeadroom`, and the cumulative-headroom integration in `balanceDomains`
`topologyspreadconstraint_test.go`	New `TestTopologySpreadConstraint` cases for cumulative-overflow, baseline contrast, multi-domain redirection, per-node-fit still required, multi-node-per-domain aggregation, and default-off regression; plus `TestGroupNodesByDomain`, `TestComputeDomainHeadroom`, `TestPodFitsSomeDomainWithHeadroom` unit tests
`docs/proposals/zone-aware-nodefit.md`	RFC / design doc
`README.md`	User-facing documentation
`hack/lib/go.sh`, `Makefile`	Tooling, unrelated to the feature — added go1.26 to the supported version regex and bumped `golangci-lint` to v2.12.2 (built with go1.26.2) via `go install`. Required to make `pull-descheduler-verify-master` green after the prow runner upgraded to go1.26 mid-2026-05-18; the same Makefile changes are queued in #1874. Happy to split out if preferred.

Test Plan

k8s-ci-robot · 2026-04-16T14:14:28Z

Welcome @BrunoChauvet!

It looks like this is your first PR to kubernetes-sigs/descheduler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/descheduler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

linux-foundation-easycla · 2026-04-16T14:14:28Z

The committers listed above are authorized under a signed CLA.

✅ login: BrunoChauvet / name: Bruno Chauvet (276c17f, 406d970, 548f6f5, 7893f00, 8007741, 80768db, 866f8e5, 87372a7, 8cdd120, 991c789, cd06e7e, dbb06c1, f2a0ff8, 1ff3716, 246936c, 8f501c4, a0263c9)

k8s-ci-robot · 2026-04-16T14:14:29Z

Hi @BrunoChauvet. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copilot

Pull request overview

This PR introduces an opt-in zoneAwareNodeFit configuration flag for the RemovePodsViolatingTopologySpreadConstraint plugin, intended to gate evictions based on per-zone scheduling capacity to reduce eviction churn.

Changes:

Added ZoneAwareNodeFit to plugin args, with defaults and deepcopy support.
Implemented zone-grouped “below ideal avg” node collection and a new fit-gating helper.
Added unit/integration tests plus user/design documentation for the new flag.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/types.go	Adds `ZoneAwareNodeFit` argument and inline semantics comment.
pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/defaults.go	Defaults `ZoneAwareNodeFit` to `false`.
pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/defaults_test.go	Updates defaulting expectations and adds a preservation test for `ZoneAwareNodeFit=true`.
pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/zz_generated.deepcopy.go	Regenerates deepcopy logic for the new pointer field.
pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint.go	Adds zone-aware node grouping and an additional eviction gate.
pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint_test.go	Adds new ZoneAwareNodeFit-focused scenarios and a unit test for zone grouping.
docs/proposals/zone-aware-nodefit.md	New RFC-style design/proposal document.
README.md	Documents the new `zoneAwareNodeFit` option for users.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

BrunoChauvet · 2026-04-16T15:15:42Z

Code review

Found 2 issues:

RFC doc comment claims filterNodesByZoneBelowIdealAvg returns nil when no under-loaded domains exist, but the implementation always returns a non-nil empty map (result := make(map[string][]*v1.Node)). The function's own Go doc comment in the source is correct ("Returns an empty map (not nil)"), so only the RFC needs fixing.

https://github.com/BrunoChauvet/descheduler/blob/f56647c3c0cd2e59df4ab4ed1023eb86ea0d9818/docs/proposals/zone-aware-nodefit.md#L130-L134

All four ZoneAwareNodeFit test cases leave TopologyBalanceNodeFit at its default (true), meaning the existing topologyBalanceNodeFit gate already determines every outcome. There is no test with topologyBalanceNodeFit: false, zoneAwareNodeFit: true that demonstrates the new flag independently changes the eviction decision. Without that, the test suite does not verify the flag does anything on its own.

https://github.com/BrunoChauvet/descheduler/blob/f56647c3c0cd2e59df4ab4ed1023eb86ea0d9818/pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint_test.go#L1457-L1580

🤖 Generated with Claude Code

_{If this code review was useful, please react with 👍. Otherwise, react with 👎.}

BrunoChauvet · 2026-04-16T15:39:11Z

/easycla

googs1025 · 2026-04-17T07:43:06Z

/ok-to-test

BrunoChauvet · 2026-04-18T17:14:22Z

/retest-required

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Reviewers correctly identified that the previous implementation reduced to OR(OR(domains)) = OR(merge(domains)), equivalent to topologyBalanceNodeFit's existing per-node fit check over the union of under-loaded domain nodes. The per-domain map keys were never used inside the plugin and the gate added nothing beyond a renamed alias of the existing flag. Redesign the criterion so per-domain grouping is load-bearing: - groupNodesByDomain classifies under-loaded nodes inline (no duplicated filterNodesBelowIdealAvg traversal). - computeDomainHeadroom builds a per-domain aggregate (cpu / memory / pod count) of allocatable minus already-requested resources, once per balanceDomains call. - podFitsSomeDomainWithHeadroom now requires BOTH a fitting node AND remaining aggregate headroom in some specific domain, decrementing that domain's headroom on commit. Pod N's decision depends on commitments 1..N-1, which is not expressible as OR(merge). This catches the churn case from kubernetes-sigs#1534/kubernetes-sigs#1067: when balanceDomains schedules N evictions toward an under-loaded domain that only has aggregate headroom for K<N pods, the existing flag stateless-passes all N (the indexer is not mutated mid-loop) and the scheduler returns the excess to the over-loaded domain. The cumulative gate caps eviction at K. Test cases replaced with five that exercise the new design (cumulative- overflow, baseline contrast, multi-domain redirection, per-node fit still required, default-off regression). Unit tests on the new helper cover deterministic ordering, headroom drain, and both rejection paths. RFC and README updated to drop the obsolete "independent gate when topologyBalanceNodeFit is off" framing.

…g tests The first revision of the test suite only included one case (ZANF=true with TBNF=false) that would fail under the previous OR(merge) implementation; the others either tested baselines or regressions. That left thin coverage of the exact bug the redesign fixes. - Add a case with both flags enabled (TBNF=true default + ZANF=true) that demonstrates the new gate caps cumulative load on top of the existing flag. - Tighten the multi-domain redirection test so zoneC also has bounded headroom: the previous impl would admit all 4 evictions; the new gate admits 3 (zoneB drains after 1, zoneC after 2, 4th rejected). - Add unit tests for groupNodesByDomain (label-based classification) and computeDomainHeadroom (allocatable - requested aggregation). Update the RFC's test plan to explicitly mark which cases catch the prior implementation's gap vs. which are regression coverage, and add a Mermaid sequence diagram contrasting topologyBalanceNodeFit's stateless per-node check against zoneAwareNodeFit's cumulative per-domain headroom tracking.

Copilot

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.

Files not reviewed (1)

pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/zz_generated.deepcopy.go: Language not supported

Comments suppressed due to low confidence (1)

pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint.go:467

ListPodsOnANode is called with a nil filter, so pods in terminal phases (Succeeded/Failed) and pods being deleted are counted as consuming CPU/memory/pod-slot capacity. Those pods do not actually hold resources on the node from the scheduler's perspective, so the computed headroom can be systematically under-estimated, causing the gate to reject evictions that would in fact fit. Consider passing a filter that excludes terminal pods (similar to how the scheduler accounts for node usage).

			podsOnNode, err := podutil.ListPodsOnANode(n.Name, nodeIndexer, nil)
			if err != nil {
				continue
			}
			for _, p := range podsOnNode {
				req, _ := utils.PodRequestsAndLimits(p)
				if q, ok := req[v1.ResourceCPU]; ok {
					cpuMilli -= q.MilliValue()
				}
				if q, ok := req[v1.ResourceMemory]; ok {
					memBytes -= q.Value()
				}
				podSlots--
			}

BrunoChauvet · 2026-05-18T15:03:42Z

+// contribute only their raw allocatable (best-effort, conservatively over-counting
+// rather than blocking eviction).
+func computeDomainHeadroom(nodesByDomain map[string][]*v1.Node, nodeIndexer podutil.GetPodsAssignedToNodeFunc) map[string]v1.ResourceList {
+	result := make(map[string]v1.ResourceList, len(nodesByDomain))
+	for domain, dnodes := range nodesByDomain {
+		var cpuMilli, memBytes, podSlots int64
+		for _, n := range dnodes {
+			cpuMilli += n.Status.Allocatable.Cpu().MilliValue()
+			memBytes += n.Status.Allocatable.Memory().Value()
+			podSlots += n.Status.Allocatable.Pods().Value()
+			podsOnNode, err := podutil.ListPodsOnANode(n.Name, nodeIndexer, nil)
+			if err != nil {
+				continue
+			}


Addressed in 72a0bfa. computeDomainHeadroom now fails closed on indexer error: the failing node is logged via klog.V(2).ErrorS with the node name and domain, and is omitted from the aggregate entirely (neither its allocatable nor any of its pod requests counted). The previous code path counted allocatable but skipped the subtraction step, which was the unsafe direction you flagged. A new unit test TestComputeDomainHeadroomFailsClosedOnIndexerError pins the new behaviour.

BrunoChauvet · 2026-05-18T15:03:47Z

+	domains := make([]string, 0, len(nodesByDomain))
+	for d := range nodesByDomain {
+		domains = append(domains, d)
+	}
+	sort.Strings(domains)
+
+	for _, domain := range domains {
+		if !headroomCoversPod(remainingHeadroom[domain], podReq) {
+			continue
+		}
+		if !node.PodFitsAnyOtherNode(nodeIndexer, pod, nodesByDomain[domain]) {
+			continue
+		}
+		subtractPodFromHeadroom(remainingHeadroom, domain, podReq)
+		return domain, true
+	}
+	return "", false


Addressed in 72a0bfa. podFitsSomeDomainWithHeadroom now sorts domains by descending remaining CPU headroom with an alphabetical tie-breaker, instead of pure alphabetical order. This spreads load toward the roomiest under-loaded domain first while keeping deterministic test behaviour for ties. The doc comment on the function and the RFC test-plan table were updated accordingly, and the corresponding unit test was rewritten (commits to roomiest qualifying domain + equal headroom falls back to alphabetical for determinism).

BrunoChauvet · 2026-05-18T15:03:53Z

+// groupNodesByDomain classifies the given nodes by their topology-domain label value.
+// Nodes missing the label are skipped. Returns an empty (non-nil) map if no nodes have the label.
+func groupNodesByDomain(nodes []*v1.Node, topologyKey string) map[string][]*v1.Node {
+	result := make(map[string][]*v1.Node)
+	for _, n := range nodes {
+		if v, ok := n.Labels[topologyKey]; ok {
+			result[v] = append(result[v], n)
+		}
+	}
+	return result
+}
+
+// computeDomainHeadroom returns, for each topology domain, the aggregate remaining
+// resource headroom (allocatable minus already-requested) summed across the domain's
+// nodes. Tracked resources: cpu, memory, pod count. Nodes whose pod listing fails
+// contribute only their raw allocatable (best-effort, conservatively over-counting
+// rather than blocking eviction).
+func computeDomainHeadroom(nodesByDomain map[string][]*v1.Node, nodeIndexer podutil.GetPodsAssignedToNodeFunc) map[string]v1.ResourceList {
+	result := make(map[string]v1.ResourceList, len(nodesByDomain))
+	for domain, dnodes := range nodesByDomain {
+		var cpuMilli, memBytes, podSlots int64
+		for _, n := range dnodes {
+			cpuMilli += n.Status.Allocatable.Cpu().MilliValue()
+			memBytes += n.Status.Allocatable.Memory().Value()
+			podSlots += n.Status.Allocatable.Pods().Value()
+			podsOnNode, err := podutil.ListPodsOnANode(n.Name, nodeIndexer, nil)
+			if err != nil {
+				continue
+			}
+			for _, p := range podsOnNode {
+				req, _ := utils.PodRequestsAndLimits(p)
+				if q, ok := req[v1.ResourceCPU]; ok {
+					cpuMilli -= q.MilliValue()
+				}
+				if q, ok := req[v1.ResourceMemory]; ok {
+					memBytes -= q.Value()
+				}
+				podSlots--
+			}
+		}
+		result[domain] = v1.ResourceList{
+			v1.ResourceCPU:    *resource.NewMilliQuantity(cpuMilli, resource.DecimalSI),
+			v1.ResourceMemory: *resource.NewQuantity(memBytes, resource.BinarySI),
+			v1.ResourcePods:   *resource.NewQuantity(podSlots, resource.DecimalSI),
+		}
+	}
+	return result
+}


Already addressed in the PR description rewrite from 2026-05-18: the Changes table now references groupNodesByDomain, computeDomainHeadroom, podFitsSomeDomainWithHeadroom, headroomCoversPod, and subtractPodFromHeadroom, and the Test Plan calls out the corresponding unit-test functions by name.

BrunoChauvet · 2026-05-18T15:03:58Z

+`zoneAwareNodeFit` tracks per-domain cumulative state. Use it when batched eviction
+would otherwise push more pods toward an under-loaded domain than it can actually
+absorb, causing the scheduler to send the excess back to the over-loaded domain
+(eviction churn).


Fixed in 72a0bfa. A blank line was inserted before the zoneAwareNodeFit yaml fence so it renders reliably inside the surrounding prose, matching the pattern used by the topologyBalanceNodeFit example block immediately above.

BrunoChauvet · 2026-05-17T12:35:23Z

Thank you @ingvagabund for the careful review — you saved this PR from shipping a no-op. You were right that the original design collapsed to OR(merge(domains)). The redesign now tracks cumulative per-domain headroom across the balancing batch: each commit decrements the target domain's remaining capacity, so later candidates see drained state. The RFC has a sequence diagram walking through it, and the test coverage is strengthened with cases that would have failed under the previous implementation. Sorry for the inbox noise while we worked it out.

CI's gofumpt v2.8.0 lint flagged the signature headroom v1.ResourceList, podReq v1.ResourceList which should be written as headroom, podReq v1.ResourceList

BrunoChauvet · 2026-05-17T12:49:37Z

/retest pull-descheduler-unit-test-master-master

BrunoChauvet · 2026-05-17T13:21:52Z

/retest

Add a TestTopologySpreadConstraint case where the under-loaded domain (zoneB) contains two nodes B1 and B2 that each have one existing pod. Aggregate headroom is the sum of both nodes' free capacity; a grouping bug that mis-bucketed nodes would under-count and yield 1 eviction instead of the expected 2. Addresses review feedback on grouping correctness coverage.

BrunoChauvet · 2026-05-18T14:25:35Z

Force-pushed to strip Co-Authored-By: Claude* trailers from five commits (85b40ea6, 9e7ec98e, bb891e17, b1775e3c, fbde9902) — these were blocking EasyCLA because noreply@anthropic.com is not a CLA-on-file co-author. Tree contents are unchanged on those commits; only commit messages and resulting SHAs differ. Also added a multi-node-per-domain test case and refreshed the PR description Test Plan, addressing the May 17 review comment.

/easycla
/retest

Copilot

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.

Files not reviewed (1)

pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/zz_generated.deepcopy.go: Language not supported

Comments suppressed due to low confidence (2)

pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint.go:457

The error returned from ListPodsOnANode is silently swallowed with a bare continue. In that case, the node's full allocatable is still added to the domain's headroom but none of its existing pod requests are subtracted, which significantly over-counts headroom for that node and can lead to admitting more evictions than the domain can actually absorb — the exact churn this feature is intended to prevent. At minimum the error should be logged (klog.V(2)) so operators can diagnose this case; ideally the domain should be marked as "headroom unknown" and excluded from the gate's positive admission decisions.

			podsOnNode, err := podutil.ListPodsOnANode(n.Name, nodeIndexer, nil)
			if err != nil {
				continue
			}

pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint.go:520

headroomCoversPod only validates cpu, memory and pod count. Pods may request other resources tracked by fitsRequest (e.g., ephemeral-storage, hugepages, extended resources like GPUs); for those, the per-domain headroom gate will silently admit a pod that the scheduler/per-node fit check would later reject in the target domain, undermining the cumulative-headroom guarantee. Either iterate over all resource names present in podReq (mirroring fitsRequest's behaviour) or document explicitly that the gate is approximate and only covers cpu/memory/pods.

func headroomCoversPod(headroom, podReq v1.ResourceList) bool {
	cpuHead := headroom[v1.ResourceCPU]
	memHead := headroom[v1.ResourceMemory]
	podsHead := headroom[v1.ResourcePods]
	cpuReq := podReq[v1.ResourceCPU]
	memReq := podReq[v1.ResourceMemory]
	return cpuHead.MilliValue() >= cpuReq.MilliValue() &&
		memHead.Value() >= memReq.Value() &&
		podsHead.Value() >= 1
}

BrunoChauvet · 2026-05-18T15:04:20Z

+func computeDomainHeadroom(nodesByDomain map[string][]*v1.Node, nodeIndexer podutil.GetPodsAssignedToNodeFunc) map[string]v1.ResourceList {
+	result := make(map[string]v1.ResourceList, len(nodesByDomain))
+	for domain, dnodes := range nodesByDomain {
+		var cpuMilli, memBytes, podSlots int64
+		for _, n := range dnodes {
+			cpuMilli += n.Status.Allocatable.Cpu().MilliValue()
+			memBytes += n.Status.Allocatable.Memory().Value()
+			podSlots += n.Status.Allocatable.Pods().Value()
+			podsOnNode, err := podutil.ListPodsOnANode(n.Name, nodeIndexer, nil)
+			if err != nil {
+				continue
+			}
+			for _, p := range podsOnNode {
+				req, _ := utils.PodRequestsAndLimits(p)
+				if q, ok := req[v1.ResourceCPU]; ok {
+					cpuMilli -= q.MilliValue()
+				}
+				if q, ok := req[v1.ResourceMemory]; ok {
+					memBytes -= q.Value()
+				}
+				podSlots--
+			}
+		}
+		result[domain] = v1.ResourceList{
+			v1.ResourceCPU:    *resource.NewMilliQuantity(cpuMilli, resource.DecimalSI),
+			v1.ResourceMemory: *resource.NewQuantity(memBytes, resource.BinarySI),
+			v1.ResourcePods:   *resource.NewQuantity(podSlots, resource.DecimalSI),
+		}
+	}
+	return result
+}


Partial — the symptom you describe (over-counting headroom because the indexer view includes pods the scheduler would not count) is real, but nodeAvailableResources/fitsRequest in pkg/descheduler/node/node.go also calls ListPodsOnANode(..., nil) with no filter, so both sides see the same pod set today. Adding a terminal-pod filter only to computeDomainHeadroom would make the two halves of the gate disagree.

Fixed the more impactful sibling issue in 72a0bfa: indexer errors now fail closed instead of silently over-counting. If a future change tightens fitsRequests pod-set filter (e.g. excluding terminal/terminating pods), computeDomainHeadroom should adopt the same filter in the same change so the two stay aligned.

BrunoChauvet · 2026-05-18T15:04:25Z

+	domains := make([]string, 0, len(nodesByDomain))
+	for d := range nodesByDomain {
+		domains = append(domains, d)
+	}
+	sort.Strings(domains)
+
+	for _, domain := range domains {
+		if !headroomCoversPod(remainingHeadroom[domain], podReq) {
+			continue
+		}
+		if !node.PodFitsAnyOtherNode(nodeIndexer, pod, nodesByDomain[domain]) {
+			continue
+		}
+		subtractPodFromHeadroom(remainingHeadroom, domain, podReq)
+		return domain, true
+	}
+	return "", false


Addressed in 72a0bfa. Domain iteration now uses descending remaining CPU headroom (with alphabetical tie-break for determinism), and the function-level doc comment documents the choice in code, not only in the RFC, so a future maintainer changing the sort comparator would have to read why first.

BrunoChauvet · 2026-05-18T15:04:30Z

+`zoneAwareNodeFit` tracks per-domain cumulative state. Use it when batched eviction
+would otherwise push more pods toward an under-loaded domain than it can actually
+absorb, causing the scheduler to send the excess back to the over-loaded domain
+(eviction churn).


Fixed in 72a0bfa: blank line inserted before the zoneAwareNodeFit yaml fence to match the surrounding example blocks and meet CommonMark expectations.

Prow's verify image upgraded to go1.26, which trips the supported-version check in hack/lib/go.sh and makes pull-descheduler-verify-master fail on every open PR. Replace the go1.23-1.25 regex with go1.24-1.26, matching the change already queued in kubernetes-sigs#1874 (v0.36.0 release prep). This is the minimal patch required to unblock verify-master for this PR and will conflict-resolve cleanly when kubernetes-sigs#1874 merges.

Address review feedback on the ZoneAwareNodeFit gate: * computeDomainHeadroom now fails closed on indexer error. Previously a failure in ListPodsOnANode left the node's allocatable counted but its pod requests un-subtracted, systematically over-counting headroom and re-introducing the eviction churn this gate is meant to prevent. The failing node is now logged via klog.V(2).ErrorS and omitted from the aggregate entirely. New TestComputeDomainHeadroomFailsClosedOnIndexerError pins the behaviour. * podFitsSomeDomainWithHeadroom now iterates domains by descending remaining CPU headroom with an alphabetical tie-breaker, instead of pure alphabetical order. Alphabetical-first drains zoneA before zoneB regardless of which domain currently has slack, leaving a later candidate rejected even though the other domain had room. Roomiest- first spreads load more evenly while remaining deterministic for testability. * README: insert blank line before the zoneAwareNodeFit yaml fence so the code block renders reliably inside surrounding prose. * RFC: update the helper-unit-test table to reflect the new ordering semantics, add the multi-node-per-domain row, and document the fail-closed indexer-error path.

BrunoChauvet · 2026-05-18T15:04:45Z

Force-pushed 72a0bfa to address the Copilot review threads from 2026-05-17 and 2026-05-18:

computeDomainHeadroom fails closed on ListPodsOnANode errors (no more silent over-counting); covered by new TestComputeDomainHeadroomFailsClosedOnIndexerError.
Domain iteration in podFitsSomeDomainWithHeadroom switched from pure alphabetical to descending remaining CPU headroom with alphabetical tie-break; doc comment now documents the choice in code.
README zoneAwareNodeFit yaml fence given a blank-line separator.
RFC test-plan table updated with the new ordering semantics, the multi-node-per-domain row, and the fail-closed indexer-error path.

Also pushed 1c222dc earlier to unblock pull-descheduler-verify-master: hack/lib/go.sh now accepts go1.26 (matching the queued change in #1874). pull-descheduler-test-e2e-k8s-master-1-36 is still red due to the missing kindest/node:v1.36.0 Docker image — that is an infra dependency, not addressable from this PR.

/retest

Copilot

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.

Files not reviewed (1)

pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/zz_generated.deepcopy.go: Language not supported

BrunoChauvet · 2026-05-18T15:29:16Z

+	sort.Slice(domains, func(i, j int) bool {
+		ci := remainingHeadroom[domains[i]][v1.ResourceCPU]
+		cj := remainingHeadroom[domains[j]][v1.ResourceCPU]
+		if ci.MilliValue() != cj.MilliValue() {
+			return ci.MilliValue() > cj.MilliValue()
+		}
+		return domains[i] < domains[j]
+	})


Addressed via documentation in d6b6276 rather than a composite sort key. The CPU-only sort is intentional and the function-level doc comment now states that explicitly: CPU is the most commonly request-constrained resource for the workloads this gate spreads across topology domains, so a multi-resource composite would add complexity without changing behaviour for the typical case. Crucially headroomCoversPod still rejects domains that cannot absorb a memory-heavy pod, so the CPU sort only affects ordering, not correctness — a memory-heavy pod in a tight-memory domain falls through to the next sorted domain rather than being misadmitted.

BrunoChauvet · 2026-05-18T15:29:21Z

+			if zoneAwareNodeFit {
+				if _, ok := podFitsSomeDomainWithHeadroom(getPodsAssignedToNode, aboveToEvict[k], nodesByDomain, remainingHeadroom); !ok {
+					d.logger.V(2).Info("ignoring pod for eviction: no target topology domain has fit + remaining headroom", "pod", klog.KObj(aboveToEvict[k]))
+					continue
+				}
+			}
+
 			podsForEviction[aboveToEvict[k]] = struct{}{}


Acknowledged in d6b6276 with an inline comment at the call site. Deferring subtraction until a successful eviction is recorded would require restructuring the eviction loop to call back into the gate on success — that crosses a larger surface area than the gate currently owns. The current behaviour favours safety (admit fewer pods than nominally possible if a later filter rejects an admitted candidate) over throughput, which is the right direction for a gate whose stated purpose is to prevent over-admission churn. Happy to revisit if a real workload exhibits material under-admission caused by this.

BrunoChauvet · 2026-05-18T15:29:26Z

+// groupNodesByDomain classifies the given nodes by their topology-domain label value.
+// Nodes missing the label are skipped. Returns an empty (non-nil) map if no nodes have the label.
+func groupNodesByDomain(nodes []*v1.Node, topologyKey string) map[string][]*v1.Node {
+	result := make(map[string][]*v1.Node)
+	for _, n := range nodes {
+		if v, ok := n.Labels[topologyKey]; ok {
+			result[v] = append(result[v], n)
+		}
+	}
+	return result
+}


Noted. groupNodesByDomain does rebuild a similar map to the internal one in filterNodesBelowIdealAvg, but the two have different semantics: the internal map keeps all eligible nodes keyed by domain, then projects to a flat slice of only below-ideal nodes; groupNodesByDomain keys those filtered nodes again by domain. Folding them together is a worthwhile refactor but it changes filterNodesBelowIdealAvgs public contract (currently returns []*v1.Node) and affects unrelated callers. Filing as follow-up — leaving the duplication here keeps this PR focused on the gate semantics and lets the refactor land independently with its own test coverage.

BrunoChauvet · 2026-05-18T15:29:32Z

+	// When zoneAwareNodeFit is enabled, group the same under-loaded nodes by their
+	// topology-domain value and compute each domain's remaining aggregate headroom.
+	// The headroom map is mutated as evictions are committed within this call so that
+	// later candidate pods see headroom already consumed by earlier ones.
+	var nodesByDomain map[string][]*v1.Node
+	var remainingHeadroom map[string]v1.ResourceList
+	if zoneAwareNodeFit {
+		nodesByDomain = groupNodesByDomain(nodesBelowIdealAvg, tsc.TopologyKey)
+		remainingHeadroom = computeDomainHeadroom(nodesByDomain, getPodsAssignedToNode)
+	}


Addressed in d6b6276. The comment at the call site now explicitly states that the grouping and headroom snapshot are taken once at entry and are not recomputed as the (i, j) loop progresses, and that the snapshots drift conservatively: a domain saturating mid-batch stops admitting via the tracked headroom drain, and a domain falling further below idealAvg is simply not promoted into the gates view. Recomputing per iteration would cost an extra ListPodsOnANode sweep across every domain at every (i, j) step.

The pinned v2.8.0 binary is built with go1.25; under the prow runner's go1.26 toolchain it panics with "file requires newer Go version go1.26 (application built with go1.25)" while loading stdlib imports, which is why pull-descheduler-verify-master started failing on every PR after the runner upgrade. v2.12.2 (released 2026-05-06) is built with go1.26.2 and runs clean against the current tree (0 issues, no fmt diff). Pairs with the hack/lib/go.sh regex update in 1c222dc.

The install.sh checksum database doesn't match v2.12.2's tarball (got fd3a137c..., expected 8df580d2...), so the curl-based install bails. Switching to go install — same approach kubernetes-sigs#1874 already queues — pulls the binary through the module proxy with a verifiable go.sum, no external checksum DB needed.

Copilot

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.

Files not reviewed (1)

pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/zz_generated.deepcopy.go: Language not supported

BrunoChauvet · 2026-05-18T15:29:37Z

 LDFLAGS=-ldflags "-X ${LDFLAG_LOCATION}.version=${VERSION} -X ${LDFLAG_LOCATION}.buildDate=${BUILD} -X ${LDFLAG_LOCATION}.gitbranch=${BRANCH} -X ${LDFLAG_LOCATION}.gitsha1=${SHA1}"

-GOLANGCI_VERSION := v2.8.0
+GOLANGCI_VERSION := v2.12.2


Fair point on scope hygiene. The two CI commits (hack/lib/go.sh accepting go1.26 and the golangci-lint v2.8.0→v2.12.2 / install-method bump) were necessary in-PR fixes because the prow runner upgraded to go1.26 mid-day on 2026-05-18 and broke pull-descheduler-verify-master on every open PR — there was no separate landed fix to rebase onto, and PR #1874 (v0.36 release prep) carries the same Makefile change but has not merged. Without these, this PR cannot be evaluated by CI at all. If you would prefer them split out, I can revert these two commits here and open a standalone PR with just those — happy to do whichever you prefer. The PR description should mention these regardless; updating it.

BrunoChauvet · 2026-05-18T15:29:42Z

+	sort.Slice(domains, func(i, j int) bool {
+		ci := remainingHeadroom[domains[i]][v1.ResourceCPU]
+		cj := remainingHeadroom[domains[j]][v1.ResourceCPU]
+		if ci.MilliValue() != cj.MilliValue() {
+			return ci.MilliValue() > cj.MilliValue()
+		}
+		return domains[i] < domains[j]
+	})


Duplicate of #3259943720 from this review pass — addressed in d6b6276. The function-level comment now documents the CPU-only sort as the dominant heuristic and explicitly calls out that headroomCoversPod still enforces correctness for memory-heavy pods even when ordering is suboptimal. Not switching to a composite score in this PR: see the longer reply on the other thread.

BrunoChauvet · 2026-05-18T15:29:47Z

+	result := make(map[string]v1.ResourceList, len(nodesByDomain))
+	for domain, dnodes := range nodesByDomain {
+		var cpuMilli, memBytes, podSlots int64
+		for _, n := range dnodes {


Addressed in d6b6276. There is now a code-level comment at the ListPodsOnANode(..., nil) call site that explicitly ties the nil filter to nodeAvailableResources / fitsRequest in pkg/descheduler/node/node.go and notes that if the upstream filter ever tightens, this call site must be updated in the same change to keep the two halves of the gate aligned.

BrunoChauvet · 2026-05-18T15:29:51Z

+				if _, ok := podFitsSomeDomainWithHeadroom(getPodsAssignedToNode, aboveToEvict[k], nodesByDomain, remainingHeadroom); !ok {
+					d.logger.V(2).Info("ignoring pod for eviction: no target topology domain has fit + remaining headroom", "pod", klog.KObj(aboveToEvict[k]))
+					continue
+				}


Addressed in d6b6276. The chosen target domain is now logged at V(2) on admit (ZoneAwareNodeFit admitted pod; target domain headroom decremented), so the return value is exercised in production, not just in tests. Operators investigating an imbalanced fill pattern can grep these log lines to see which domain each admitted pod was steered toward.

* Log the chosen domain at V(2) when ZoneAwareNodeFit admits a candidate, so operators can see which under-loaded domain is being filled and diagnose imbalanced commits. The function's return value is no longer exercised only in tests. * Document, in code, three properties of the gate that previously lived only in the RFC / review thread: - the headroom snapshot is taken once at balanceDomains entry and drifts conservatively as the (i, j) loop progresses; - the headroom decrement happens at admit time, before the actual eviction is recorded; rejection by a later eviction filter makes the gate conservative on subsequent candidates (deferring would require restructuring the eviction loop); - the nil filter on ListPodsOnANode in computeDomainHeadroom is deliberately aligned with nodeAvailableResources / fitsRequest, and must be updated in lock-step if that upstream filter ever tightens to exclude terminal/terminating pods; - the descending-CPU sort key is the dominant heuristic; for memory-heavy pods, the headroomCoversPod check still enforces correctness even when ordering is suboptimal.

k8s-ci-robot · 2026-05-18T15:32:43Z

@BrunoChauvet: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-descheduler-test-e2e-k8s-master-1-36	`d6b6276`	link	true	`/test pull-descheduler-test-e2e-k8s-master-1-36`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copilot AI review requested due to automatic review settings April 16, 2026 14:14

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 16, 2026

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Apr 16, 2026

k8s-ci-robot requested review from googs1025 and ingvagabund April 16, 2026 14:14

Copilot started reviewing on behalf of BrunoChauvet April 16, 2026 14:15 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 16, 2026

BrunoChauvet force-pushed the feature/zone-aware-nodefit branch from 9ce74c6 to 1af0dcb Compare April 16, 2026 15:39

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 16, 2026

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 17, 2026

BrunoChauvet requested a review from Copilot April 20, 2026 16:51

Copilot started reviewing on behalf of BrunoChauvet April 20, 2026 16:51 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

Comment thread ...amework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint_test.go Outdated

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 21, 2026

BrunoChauvet requested a review from Copilot April 21, 2026 17:14

Copilot started reviewing on behalf of BrunoChauvet April 21, 2026 17:14 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

BrunoChauvet added 2 commits May 17, 2026 08:15

Copilot AI review requested due to automatic review settings May 17, 2026 12:24

Copilot started reviewing on behalf of BrunoChauvet May 17, 2026 12:25 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

style: group consecutive same-type params in headroomCoversPod (gofumpt)

a0263c9

CI's gofumpt v2.8.0 lint flagged the signature headroom v1.ResourceList, podReq v1.ResourceList which should be written as headroom, podReq v1.ResourceList

BrunoChauvet force-pushed the feature/zone-aware-nodefit branch from bb891e1 to 8f501c4 Compare May 18, 2026 14:23

Copilot AI review requested due to automatic review settings May 18, 2026 14:23

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 18, 2026

Copilot started reviewing on behalf of BrunoChauvet May 18, 2026 14:24 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

BrunoChauvet added 2 commits May 18, 2026 10:55

Copilot AI review requested due to automatic review settings May 18, 2026 15:03

Copilot started reviewing on behalf of BrunoChauvet May 18, 2026 15:04 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

BrunoChauvet added 2 commits May 18, 2026 11:11

Copilot AI review requested due to automatic review settings May 18, 2026 15:16

Copilot started reviewing on behalf of BrunoChauvet May 18, 2026 15:17 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Conversation

BrunoChauvet commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Test Plan

Uh oh!

k8s-ci-robot commented Apr 16, 2026

Uh oh!

linux-foundation-easycla Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BrunoChauvet commented Apr 16, 2026

Code review

Uh oh!

BrunoChauvet commented Apr 16, 2026

Uh oh!

googs1025 commented Apr 17, 2026

Uh oh!

BrunoChauvet commented Apr 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BrunoChauvet commented May 17, 2026

Uh oh!

BrunoChauvet commented May 17, 2026

Uh oh!

BrunoChauvet commented May 17, 2026

Uh oh!

BrunoChauvet commented May 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BrunoChauvet commented May 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

BrunoChauvet commented Apr 16, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Apr 16, 2026 •

edited

Loading