feat(tsc): add ZoneAwareNodeFit to RemovePodsViolatingTopologySpreadConstraint#1858
feat(tsc): add ZoneAwareNodeFit to RemovePodsViolatingTopologySpreadConstraint#1858BrunoChauvet wants to merge 22 commits into
Conversation
|
Welcome @BrunoChauvet! |
|
Hi @BrunoChauvet. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Pull request overview
This PR introduces an opt-in zoneAwareNodeFit configuration flag for the RemovePodsViolatingTopologySpreadConstraint plugin, intended to gate evictions based on per-zone scheduling capacity to reduce eviction churn.
Changes:
- Added
ZoneAwareNodeFitto plugin args, with defaults and deepcopy support. - Implemented zone-grouped “below ideal avg” node collection and a new fit-gating helper.
- Added unit/integration tests plus user/design documentation for the new flag.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/types.go | Adds ZoneAwareNodeFit argument and inline semantics comment. |
| pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/defaults.go | Defaults ZoneAwareNodeFit to false. |
| pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/defaults_test.go | Updates defaulting expectations and adds a preservation test for ZoneAwareNodeFit=true. |
| pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/zz_generated.deepcopy.go | Regenerates deepcopy logic for the new pointer field. |
| pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint.go | Adds zone-aware node grouping and an additional eviction gate. |
| pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint_test.go | Adds new ZoneAwareNodeFit-focused scenarios and a unit test for zone grouping. |
| docs/proposals/zone-aware-nodefit.md | New RFC-style design/proposal document. |
| README.md | Documents the new zoneAwareNodeFit option for users. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Code reviewFound 2 issues:
🤖 Generated with Claude Code If this code review was useful, please react with 👍. Otherwise, react with 👎. |
9ce74c6 to
1af0dcb
Compare
|
/easycla |
|
/ok-to-test |
|
/retest-required |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Reviewers correctly identified that the previous implementation reduced to OR(OR(domains)) = OR(merge(domains)), equivalent to topologyBalanceNodeFit's existing per-node fit check over the union of under-loaded domain nodes. The per-domain map keys were never used inside the plugin and the gate added nothing beyond a renamed alias of the existing flag. Redesign the criterion so per-domain grouping is load-bearing: - groupNodesByDomain classifies under-loaded nodes inline (no duplicated filterNodesBelowIdealAvg traversal). - computeDomainHeadroom builds a per-domain aggregate (cpu / memory / pod count) of allocatable minus already-requested resources, once per balanceDomains call. - podFitsSomeDomainWithHeadroom now requires BOTH a fitting node AND remaining aggregate headroom in some specific domain, decrementing that domain's headroom on commit. Pod N's decision depends on commitments 1..N-1, which is not expressible as OR(merge). This catches the churn case from kubernetes-sigs#1534/kubernetes-sigs#1067: when balanceDomains schedules N evictions toward an under-loaded domain that only has aggregate headroom for K<N pods, the existing flag stateless-passes all N (the indexer is not mutated mid-loop) and the scheduler returns the excess to the over-loaded domain. The cumulative gate caps eviction at K. Test cases replaced with five that exercise the new design (cumulative- overflow, baseline contrast, multi-domain redirection, per-node fit still required, default-off regression). Unit tests on the new helper cover deterministic ordering, headroom drain, and both rejection paths. RFC and README updated to drop the obsolete "independent gate when topologyBalanceNodeFit is off" framing.
…g tests The first revision of the test suite only included one case (ZANF=true with TBNF=false) that would fail under the previous OR(merge) implementation; the others either tested baselines or regressions. That left thin coverage of the exact bug the redesign fixes. - Add a case with both flags enabled (TBNF=true default + ZANF=true) that demonstrates the new gate caps cumulative load on top of the existing flag. - Tighten the multi-domain redirection test so zoneC also has bounded headroom: the previous impl would admit all 4 evictions; the new gate admits 3 (zoneB drains after 1, zoneC after 2, 4th rejected). - Add unit tests for groupNodesByDomain (label-based classification) and computeDomainHeadroom (allocatable - requested aggregation). Update the RFC's test plan to explicitly mark which cases catch the prior implementation's gap vs. which are regression coverage, and add a Mermaid sequence diagram contrasting topologyBalanceNodeFit's stateless per-node check against zoneAwareNodeFit's cumulative per-domain headroom tracking.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.
Files not reviewed (1)
- pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/zz_generated.deepcopy.go: Language not supported
Comments suppressed due to low confidence (1)
pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint.go:467
ListPodsOnANodeis called with anilfilter, so pods in terminal phases (Succeeded/Failed) and pods being deleted are counted as consuming CPU/memory/pod-slot capacity. Those pods do not actually hold resources on the node from the scheduler's perspective, so the computed headroom can be systematically under-estimated, causing the gate to reject evictions that would in fact fit. Consider passing a filter that excludes terminal pods (similar to how the scheduler accounts for node usage).
podsOnNode, err := podutil.ListPodsOnANode(n.Name, nodeIndexer, nil)
if err != nil {
continue
}
for _, p := range podsOnNode {
req, _ := utils.PodRequestsAndLimits(p)
if q, ok := req[v1.ResourceCPU]; ok {
cpuMilli -= q.MilliValue()
}
if q, ok := req[v1.ResourceMemory]; ok {
memBytes -= q.Value()
}
podSlots--
}
| // contribute only their raw allocatable (best-effort, conservatively over-counting | ||
| // rather than blocking eviction). | ||
| func computeDomainHeadroom(nodesByDomain map[string][]*v1.Node, nodeIndexer podutil.GetPodsAssignedToNodeFunc) map[string]v1.ResourceList { | ||
| result := make(map[string]v1.ResourceList, len(nodesByDomain)) | ||
| for domain, dnodes := range nodesByDomain { | ||
| var cpuMilli, memBytes, podSlots int64 | ||
| for _, n := range dnodes { | ||
| cpuMilli += n.Status.Allocatable.Cpu().MilliValue() | ||
| memBytes += n.Status.Allocatable.Memory().Value() | ||
| podSlots += n.Status.Allocatable.Pods().Value() | ||
| podsOnNode, err := podutil.ListPodsOnANode(n.Name, nodeIndexer, nil) | ||
| if err != nil { | ||
| continue | ||
| } |
There was a problem hiding this comment.
Addressed in 72a0bfa. computeDomainHeadroom now fails closed on indexer error: the failing node is logged via klog.V(2).ErrorS with the node name and domain, and is omitted from the aggregate entirely (neither its allocatable nor any of its pod requests counted). The previous code path counted allocatable but skipped the subtraction step, which was the unsafe direction you flagged. A new unit test TestComputeDomainHeadroomFailsClosedOnIndexerError pins the new behaviour.
| domains := make([]string, 0, len(nodesByDomain)) | ||
| for d := range nodesByDomain { | ||
| domains = append(domains, d) | ||
| } | ||
| sort.Strings(domains) | ||
|
|
||
| for _, domain := range domains { | ||
| if !headroomCoversPod(remainingHeadroom[domain], podReq) { | ||
| continue | ||
| } | ||
| if !node.PodFitsAnyOtherNode(nodeIndexer, pod, nodesByDomain[domain]) { | ||
| continue | ||
| } | ||
| subtractPodFromHeadroom(remainingHeadroom, domain, podReq) | ||
| return domain, true | ||
| } | ||
| return "", false |
There was a problem hiding this comment.
Addressed in 72a0bfa. podFitsSomeDomainWithHeadroom now sorts domains by descending remaining CPU headroom with an alphabetical tie-breaker, instead of pure alphabetical order. This spreads load toward the roomiest under-loaded domain first while keeping deterministic test behaviour for ties. The doc comment on the function and the RFC test-plan table were updated accordingly, and the corresponding unit test was rewritten (commits to roomiest qualifying domain + equal headroom falls back to alphabetical for determinism).
| // groupNodesByDomain classifies the given nodes by their topology-domain label value. | ||
| // Nodes missing the label are skipped. Returns an empty (non-nil) map if no nodes have the label. | ||
| func groupNodesByDomain(nodes []*v1.Node, topologyKey string) map[string][]*v1.Node { | ||
| result := make(map[string][]*v1.Node) | ||
| for _, n := range nodes { | ||
| if v, ok := n.Labels[topologyKey]; ok { | ||
| result[v] = append(result[v], n) | ||
| } | ||
| } | ||
| return result | ||
| } | ||
|
|
||
| // computeDomainHeadroom returns, for each topology domain, the aggregate remaining | ||
| // resource headroom (allocatable minus already-requested) summed across the domain's | ||
| // nodes. Tracked resources: cpu, memory, pod count. Nodes whose pod listing fails | ||
| // contribute only their raw allocatable (best-effort, conservatively over-counting | ||
| // rather than blocking eviction). | ||
| func computeDomainHeadroom(nodesByDomain map[string][]*v1.Node, nodeIndexer podutil.GetPodsAssignedToNodeFunc) map[string]v1.ResourceList { | ||
| result := make(map[string]v1.ResourceList, len(nodesByDomain)) | ||
| for domain, dnodes := range nodesByDomain { | ||
| var cpuMilli, memBytes, podSlots int64 | ||
| for _, n := range dnodes { | ||
| cpuMilli += n.Status.Allocatable.Cpu().MilliValue() | ||
| memBytes += n.Status.Allocatable.Memory().Value() | ||
| podSlots += n.Status.Allocatable.Pods().Value() | ||
| podsOnNode, err := podutil.ListPodsOnANode(n.Name, nodeIndexer, nil) | ||
| if err != nil { | ||
| continue | ||
| } | ||
| for _, p := range podsOnNode { | ||
| req, _ := utils.PodRequestsAndLimits(p) | ||
| if q, ok := req[v1.ResourceCPU]; ok { | ||
| cpuMilli -= q.MilliValue() | ||
| } | ||
| if q, ok := req[v1.ResourceMemory]; ok { | ||
| memBytes -= q.Value() | ||
| } | ||
| podSlots-- | ||
| } | ||
| } | ||
| result[domain] = v1.ResourceList{ | ||
| v1.ResourceCPU: *resource.NewMilliQuantity(cpuMilli, resource.DecimalSI), | ||
| v1.ResourceMemory: *resource.NewQuantity(memBytes, resource.BinarySI), | ||
| v1.ResourcePods: *resource.NewQuantity(podSlots, resource.DecimalSI), | ||
| } | ||
| } | ||
| return result | ||
| } |
There was a problem hiding this comment.
Already addressed in the PR description rewrite from 2026-05-18: the Changes table now references groupNodesByDomain, computeDomainHeadroom, podFitsSomeDomainWithHeadroom, headroomCoversPod, and subtractPodFromHeadroom, and the Test Plan calls out the corresponding unit-test functions by name.
| `zoneAwareNodeFit` tracks per-domain cumulative state. Use it when batched eviction | ||
| would otherwise push more pods toward an under-loaded domain than it can actually | ||
| absorb, causing the scheduler to send the excess back to the over-loaded domain | ||
| (eviction churn). |
There was a problem hiding this comment.
Fixed in 72a0bfa. A blank line was inserted before the zoneAwareNodeFit yaml fence so it renders reliably inside the surrounding prose, matching the pattern used by the topologyBalanceNodeFit example block immediately above.
|
Thank you @ingvagabund for the careful review — you saved this PR from shipping a no-op. You were right that the original design collapsed to |
CI's gofumpt v2.8.0 lint flagged the signature headroom v1.ResourceList, podReq v1.ResourceList which should be written as headroom, podReq v1.ResourceList
|
/retest pull-descheduler-unit-test-master-master |
|
/retest |
Add a TestTopologySpreadConstraint case where the under-loaded domain (zoneB) contains two nodes B1 and B2 that each have one existing pod. Aggregate headroom is the sum of both nodes' free capacity; a grouping bug that mis-bucketed nodes would under-count and yield 1 eviction instead of the expected 2. Addresses review feedback on grouping correctness coverage.
bb891e1 to
8f501c4
Compare
|
Force-pushed to strip /easycla |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.
Files not reviewed (1)
- pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/zz_generated.deepcopy.go: Language not supported
Comments suppressed due to low confidence (2)
pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint.go:457
- The error returned from ListPodsOnANode is silently swallowed with a bare
continue. In that case, the node's full allocatable is still added to the domain's headroom but none of its existing pod requests are subtracted, which significantly over-counts headroom for that node and can lead to admitting more evictions than the domain can actually absorb — the exact churn this feature is intended to prevent. At minimum the error should be logged (klog.V(2)) so operators can diagnose this case; ideally the domain should be marked as "headroom unknown" and excluded from the gate's positive admission decisions.
podsOnNode, err := podutil.ListPodsOnANode(n.Name, nodeIndexer, nil)
if err != nil {
continue
}
pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint/topologyspreadconstraint.go:520
- headroomCoversPod only validates cpu, memory and pod count. Pods may request other resources tracked by fitsRequest (e.g., ephemeral-storage, hugepages, extended resources like GPUs); for those, the per-domain headroom gate will silently admit a pod that the scheduler/per-node fit check would later reject in the target domain, undermining the cumulative-headroom guarantee. Either iterate over all resource names present in podReq (mirroring fitsRequest's behaviour) or document explicitly that the gate is approximate and only covers cpu/memory/pods.
func headroomCoversPod(headroom, podReq v1.ResourceList) bool {
cpuHead := headroom[v1.ResourceCPU]
memHead := headroom[v1.ResourceMemory]
podsHead := headroom[v1.ResourcePods]
cpuReq := podReq[v1.ResourceCPU]
memReq := podReq[v1.ResourceMemory]
return cpuHead.MilliValue() >= cpuReq.MilliValue() &&
memHead.Value() >= memReq.Value() &&
podsHead.Value() >= 1
}
| func computeDomainHeadroom(nodesByDomain map[string][]*v1.Node, nodeIndexer podutil.GetPodsAssignedToNodeFunc) map[string]v1.ResourceList { | ||
| result := make(map[string]v1.ResourceList, len(nodesByDomain)) | ||
| for domain, dnodes := range nodesByDomain { | ||
| var cpuMilli, memBytes, podSlots int64 | ||
| for _, n := range dnodes { | ||
| cpuMilli += n.Status.Allocatable.Cpu().MilliValue() | ||
| memBytes += n.Status.Allocatable.Memory().Value() | ||
| podSlots += n.Status.Allocatable.Pods().Value() | ||
| podsOnNode, err := podutil.ListPodsOnANode(n.Name, nodeIndexer, nil) | ||
| if err != nil { | ||
| continue | ||
| } | ||
| for _, p := range podsOnNode { | ||
| req, _ := utils.PodRequestsAndLimits(p) | ||
| if q, ok := req[v1.ResourceCPU]; ok { | ||
| cpuMilli -= q.MilliValue() | ||
| } | ||
| if q, ok := req[v1.ResourceMemory]; ok { | ||
| memBytes -= q.Value() | ||
| } | ||
| podSlots-- | ||
| } | ||
| } | ||
| result[domain] = v1.ResourceList{ | ||
| v1.ResourceCPU: *resource.NewMilliQuantity(cpuMilli, resource.DecimalSI), | ||
| v1.ResourceMemory: *resource.NewQuantity(memBytes, resource.BinarySI), | ||
| v1.ResourcePods: *resource.NewQuantity(podSlots, resource.DecimalSI), | ||
| } | ||
| } | ||
| return result | ||
| } |
There was a problem hiding this comment.
Partial — the symptom you describe (over-counting headroom because the indexer view includes pods the scheduler would not count) is real, but nodeAvailableResources/fitsRequest in pkg/descheduler/node/node.go also calls ListPodsOnANode(..., nil) with no filter, so both sides see the same pod set today. Adding a terminal-pod filter only to computeDomainHeadroom would make the two halves of the gate disagree.
Fixed the more impactful sibling issue in 72a0bfa: indexer errors now fail closed instead of silently over-counting. If a future change tightens fitsRequests pod-set filter (e.g. excluding terminal/terminating pods), computeDomainHeadroom should adopt the same filter in the same change so the two stay aligned.
| domains := make([]string, 0, len(nodesByDomain)) | ||
| for d := range nodesByDomain { | ||
| domains = append(domains, d) | ||
| } | ||
| sort.Strings(domains) | ||
|
|
||
| for _, domain := range domains { | ||
| if !headroomCoversPod(remainingHeadroom[domain], podReq) { | ||
| continue | ||
| } | ||
| if !node.PodFitsAnyOtherNode(nodeIndexer, pod, nodesByDomain[domain]) { | ||
| continue | ||
| } | ||
| subtractPodFromHeadroom(remainingHeadroom, domain, podReq) | ||
| return domain, true | ||
| } | ||
| return "", false |
There was a problem hiding this comment.
Addressed in 72a0bfa. Domain iteration now uses descending remaining CPU headroom (with alphabetical tie-break for determinism), and the function-level doc comment documents the choice in code, not only in the RFC, so a future maintainer changing the sort comparator would have to read why first.
| `zoneAwareNodeFit` tracks per-domain cumulative state. Use it when batched eviction | ||
| would otherwise push more pods toward an under-loaded domain than it can actually | ||
| absorb, causing the scheduler to send the excess back to the over-loaded domain | ||
| (eviction churn). |
There was a problem hiding this comment.
Fixed in 72a0bfa: blank line inserted before the zoneAwareNodeFit yaml fence to match the surrounding example blocks and meet CommonMark expectations.
Prow's verify image upgraded to go1.26, which trips the supported-version check in hack/lib/go.sh and makes pull-descheduler-verify-master fail on every open PR. Replace the go1.23-1.25 regex with go1.24-1.26, matching the change already queued in kubernetes-sigs#1874 (v0.36.0 release prep). This is the minimal patch required to unblock verify-master for this PR and will conflict-resolve cleanly when kubernetes-sigs#1874 merges.
Address review feedback on the ZoneAwareNodeFit gate: * computeDomainHeadroom now fails closed on indexer error. Previously a failure in ListPodsOnANode left the node's allocatable counted but its pod requests un-subtracted, systematically over-counting headroom and re-introducing the eviction churn this gate is meant to prevent. The failing node is now logged via klog.V(2).ErrorS and omitted from the aggregate entirely. New TestComputeDomainHeadroomFailsClosedOnIndexerError pins the behaviour. * podFitsSomeDomainWithHeadroom now iterates domains by descending remaining CPU headroom with an alphabetical tie-breaker, instead of pure alphabetical order. Alphabetical-first drains zoneA before zoneB regardless of which domain currently has slack, leaving a later candidate rejected even though the other domain had room. Roomiest- first spreads load more evenly while remaining deterministic for testability. * README: insert blank line before the zoneAwareNodeFit yaml fence so the code block renders reliably inside surrounding prose. * RFC: update the helper-unit-test table to reflect the new ordering semantics, add the multi-node-per-domain row, and document the fail-closed indexer-error path.
|
Force-pushed 72a0bfa to address the Copilot review threads from 2026-05-17 and 2026-05-18:
Also pushed 1c222dc earlier to unblock /retest |
| sort.Slice(domains, func(i, j int) bool { | ||
| ci := remainingHeadroom[domains[i]][v1.ResourceCPU] | ||
| cj := remainingHeadroom[domains[j]][v1.ResourceCPU] | ||
| if ci.MilliValue() != cj.MilliValue() { | ||
| return ci.MilliValue() > cj.MilliValue() | ||
| } | ||
| return domains[i] < domains[j] | ||
| }) |
There was a problem hiding this comment.
Addressed via documentation in d6b6276 rather than a composite sort key. The CPU-only sort is intentional and the function-level doc comment now states that explicitly: CPU is the most commonly request-constrained resource for the workloads this gate spreads across topology domains, so a multi-resource composite would add complexity without changing behaviour for the typical case. Crucially headroomCoversPod still rejects domains that cannot absorb a memory-heavy pod, so the CPU sort only affects ordering, not correctness — a memory-heavy pod in a tight-memory domain falls through to the next sorted domain rather than being misadmitted.
| if zoneAwareNodeFit { | ||
| if _, ok := podFitsSomeDomainWithHeadroom(getPodsAssignedToNode, aboveToEvict[k], nodesByDomain, remainingHeadroom); !ok { | ||
| d.logger.V(2).Info("ignoring pod for eviction: no target topology domain has fit + remaining headroom", "pod", klog.KObj(aboveToEvict[k])) | ||
| continue | ||
| } | ||
| } | ||
|
|
||
| podsForEviction[aboveToEvict[k]] = struct{}{} |
There was a problem hiding this comment.
Acknowledged in d6b6276 with an inline comment at the call site. Deferring subtraction until a successful eviction is recorded would require restructuring the eviction loop to call back into the gate on success — that crosses a larger surface area than the gate currently owns. The current behaviour favours safety (admit fewer pods than nominally possible if a later filter rejects an admitted candidate) over throughput, which is the right direction for a gate whose stated purpose is to prevent over-admission churn. Happy to revisit if a real workload exhibits material under-admission caused by this.
| // groupNodesByDomain classifies the given nodes by their topology-domain label value. | ||
| // Nodes missing the label are skipped. Returns an empty (non-nil) map if no nodes have the label. | ||
| func groupNodesByDomain(nodes []*v1.Node, topologyKey string) map[string][]*v1.Node { | ||
| result := make(map[string][]*v1.Node) | ||
| for _, n := range nodes { | ||
| if v, ok := n.Labels[topologyKey]; ok { | ||
| result[v] = append(result[v], n) | ||
| } | ||
| } | ||
| return result | ||
| } |
There was a problem hiding this comment.
Noted. groupNodesByDomain does rebuild a similar map to the internal one in filterNodesBelowIdealAvg, but the two have different semantics: the internal map keeps all eligible nodes keyed by domain, then projects to a flat slice of only below-ideal nodes; groupNodesByDomain keys those filtered nodes again by domain. Folding them together is a worthwhile refactor but it changes filterNodesBelowIdealAvgs public contract (currently returns []*v1.Node) and affects unrelated callers. Filing as follow-up — leaving the duplication here keeps this PR focused on the gate semantics and lets the refactor land independently with its own test coverage.
| // When zoneAwareNodeFit is enabled, group the same under-loaded nodes by their | ||
| // topology-domain value and compute each domain's remaining aggregate headroom. | ||
| // The headroom map is mutated as evictions are committed within this call so that | ||
| // later candidate pods see headroom already consumed by earlier ones. | ||
| var nodesByDomain map[string][]*v1.Node | ||
| var remainingHeadroom map[string]v1.ResourceList | ||
| if zoneAwareNodeFit { | ||
| nodesByDomain = groupNodesByDomain(nodesBelowIdealAvg, tsc.TopologyKey) | ||
| remainingHeadroom = computeDomainHeadroom(nodesByDomain, getPodsAssignedToNode) | ||
| } |
There was a problem hiding this comment.
Addressed in d6b6276. The comment at the call site now explicitly states that the grouping and headroom snapshot are taken once at entry and are not recomputed as the (i, j) loop progresses, and that the snapshots drift conservatively: a domain saturating mid-batch stops admitting via the tracked headroom drain, and a domain falling further below idealAvg is simply not promoted into the gates view. Recomputing per iteration would cost an extra ListPodsOnANode sweep across every domain at every (i, j) step.
The pinned v2.8.0 binary is built with go1.25; under the prow runner's go1.26 toolchain it panics with "file requires newer Go version go1.26 (application built with go1.25)" while loading stdlib imports, which is why pull-descheduler-verify-master started failing on every PR after the runner upgrade. v2.12.2 (released 2026-05-06) is built with go1.26.2 and runs clean against the current tree (0 issues, no fmt diff). Pairs with the hack/lib/go.sh regex update in 1c222dc.
The install.sh checksum database doesn't match v2.12.2's tarball (got fd3a137c..., expected 8df580d2...), so the curl-based install bails. Switching to go install — same approach kubernetes-sigs#1874 already queues — pulls the binary through the module proxy with a verifiable go.sum, no external checksum DB needed.
| LDFLAGS=-ldflags "-X ${LDFLAG_LOCATION}.version=${VERSION} -X ${LDFLAG_LOCATION}.buildDate=${BUILD} -X ${LDFLAG_LOCATION}.gitbranch=${BRANCH} -X ${LDFLAG_LOCATION}.gitsha1=${SHA1}" | ||
|
|
||
| GOLANGCI_VERSION := v2.8.0 | ||
| GOLANGCI_VERSION := v2.12.2 |
There was a problem hiding this comment.
Fair point on scope hygiene. The two CI commits (hack/lib/go.sh accepting go1.26 and the golangci-lint v2.8.0→v2.12.2 / install-method bump) were necessary in-PR fixes because the prow runner upgraded to go1.26 mid-day on 2026-05-18 and broke pull-descheduler-verify-master on every open PR — there was no separate landed fix to rebase onto, and PR #1874 (v0.36 release prep) carries the same Makefile change but has not merged. Without these, this PR cannot be evaluated by CI at all. If you would prefer them split out, I can revert these two commits here and open a standalone PR with just those — happy to do whichever you prefer. The PR description should mention these regardless; updating it.
| sort.Slice(domains, func(i, j int) bool { | ||
| ci := remainingHeadroom[domains[i]][v1.ResourceCPU] | ||
| cj := remainingHeadroom[domains[j]][v1.ResourceCPU] | ||
| if ci.MilliValue() != cj.MilliValue() { | ||
| return ci.MilliValue() > cj.MilliValue() | ||
| } | ||
| return domains[i] < domains[j] | ||
| }) |
There was a problem hiding this comment.
Duplicate of #3259943720 from this review pass — addressed in d6b6276. The function-level comment now documents the CPU-only sort as the dominant heuristic and explicitly calls out that headroomCoversPod still enforces correctness for memory-heavy pods even when ordering is suboptimal. Not switching to a composite score in this PR: see the longer reply on the other thread.
| result := make(map[string]v1.ResourceList, len(nodesByDomain)) | ||
| for domain, dnodes := range nodesByDomain { | ||
| var cpuMilli, memBytes, podSlots int64 | ||
| for _, n := range dnodes { |
There was a problem hiding this comment.
Addressed in d6b6276. There is now a code-level comment at the ListPodsOnANode(..., nil) call site that explicitly ties the nil filter to nodeAvailableResources / fitsRequest in pkg/descheduler/node/node.go and notes that if the upstream filter ever tightens, this call site must be updated in the same change to keep the two halves of the gate aligned.
| if _, ok := podFitsSomeDomainWithHeadroom(getPodsAssignedToNode, aboveToEvict[k], nodesByDomain, remainingHeadroom); !ok { | ||
| d.logger.V(2).Info("ignoring pod for eviction: no target topology domain has fit + remaining headroom", "pod", klog.KObj(aboveToEvict[k])) | ||
| continue | ||
| } |
There was a problem hiding this comment.
Addressed in d6b6276. The chosen target domain is now logged at V(2) on admit (ZoneAwareNodeFit admitted pod; target domain headroom decremented), so the return value is exercised in production, not just in tests. Operators investigating an imbalanced fill pattern can grep these log lines to see which domain each admitted pod was steered toward.
* Log the chosen domain at V(2) when ZoneAwareNodeFit admits a candidate,
so operators can see which under-loaded domain is being filled and
diagnose imbalanced commits. The function's return value is no longer
exercised only in tests.
* Document, in code, three properties of the gate that previously lived
only in the RFC / review thread:
- the headroom snapshot is taken once at balanceDomains entry and
drifts conservatively as the (i, j) loop progresses;
- the headroom decrement happens at admit time, before the actual
eviction is recorded; rejection by a later eviction filter makes
the gate conservative on subsequent candidates (deferring would
require restructuring the eviction loop);
- the nil filter on ListPodsOnANode in computeDomainHeadroom is
deliberately aligned with nodeAvailableResources / fitsRequest,
and must be updated in lock-step if that upstream filter ever
tightens to exclude terminal/terminating pods;
- the descending-CPU sort key is the dominant heuristic; for
memory-heavy pods, the headroomCoversPod check still enforces
correctness even when ordering is suboptimal.
|
@BrunoChauvet: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
zoneAwareNodeFit(opt-in, defaultfalse) toRemovePodsViolatingTopologySpreadConstraintArgstopologyBalanceNodeFit, which is a stateless per-node fit check — multiple candidates can all individually look like they fit on the same target node, leading to overcommit churnMotivation
Upstream issues #1534 and #1067 describe eviction churn where the existing stateless gate admits more candidates than the target domain can actually absorb. This PR adds a stateful per-domain capacity gate as a non-breaking, opt-in flag.
Full design rationale, sequence diagram contrasting the two flags, and alternatives considered:
docs/proposals/zone-aware-nodefit.mdChanges
types.goZoneAwareNodeFit *boolfielddefaults.goZoneAwareNodeFittofalsezz_generated.deepcopy.gotopologyspreadconstraint.gogroupNodesByDomain,computeDomainHeadroom,podFitsSomeDomainWithHeadroom,headroomCoversPod,subtractPodFromHeadroom, and the cumulative-headroom integration inbalanceDomainstopologyspreadconstraint_test.goTestTopologySpreadConstraintcases for cumulative-overflow, baseline contrast, multi-domain redirection, per-node-fit still required, multi-node-per-domain aggregation, and default-off regression; plusTestGroupNodesByDomain,TestComputeDomainHeadroom,TestPodFitsSomeDomainWithHeadroomunit testsdocs/proposals/zone-aware-nodefit.mdREADME.mdhack/lib/go.sh,Makefilegolangci-lintto v2.12.2 (built with go1.26.2) viago install. Required to makepull-descheduler-verify-mastergreen after the prow runner upgraded to go1.26 mid-2026-05-18; the same Makefile changes are queued in #1874. Happy to split out if preferred.Test Plan
ZoneAwareNodeFit=true+TopologyBalanceNodeFit=true: cumulative gate caps evictions even when stateless check would admit more (gap-catching: would have evicted 3 under priorOR(merge)design; now evicts 1)ZoneAwareNodeFit=truealone: aggregate-headroom drain admits exactly one eviction per fitting domain (gap-catching)TopologyBalanceNodeFit=truealone: baseline that does NOT cap cumulative load (contrast case — evicts 3)ZoneAwareNodeFit=false(default): existingTopologyBalanceNodeFitbehaviour unchangedTestGroupNodesByDomainunit test: label-based classification, multi-node grouping, missing-label skipTestComputeDomainHeadroomunit test: allocatable-minus-requested aggregation across nodesTestPodFitsSomeDomainWithHeadroomunit test: deterministic domain ordering, headroom drain, per-node-fit rejection, aggregate-headroom rejection