Skip to content

TAS: Stop implicitly assuming ResrouceFlavor:Node = 1:N#11210

Open
tenzen-y wants to merge 6 commits into
kubernetes-sigs:mainfrom
tenzen-y:allow-multiple-flavors-for-node
Open

TAS: Stop implicitly assuming ResrouceFlavor:Node = 1:N#11210
tenzen-y wants to merge 6 commits into
kubernetes-sigs:mainfrom
tenzen-y:allow-multiple-flavors-for-node

Conversation

@tenzen-y
Copy link
Copy Markdown
Member

@tenzen-y tenzen-y commented May 15, 2026

What type of PR is this?

/kind bug
/area tas

What this PR does / why we need it:

I fixed a TAS over-subscription bug where multiple ResourceFlavor referencing the same Topology and overlapping on physical nodes each independently track usage, so sibling flavors believe a shared node is only partially used.

This PR refined the TAS cache so that sibling flavors aggregate usage.

Which issue(s) this PR fixes:

Fixes #10659

Special notes for your reviewer:

#10657 is problem reproduced PR.

Does this PR introduce a user-facing change?

TAS: fix over-subscription of nodes that belong to multiple ResourceFlavors sharing the same hostname-leaf Topology.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. area/tas Topology-Aware Scheduling labels May 15, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 15, 2026

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 8bfa984
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a0b47b591dd030008c0210d

@k8s-ci-robot k8s-ci-robot requested a review from olekzabl May 15, 2026 05:53
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from pajakd May 15, 2026 05:53
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 15, 2026
@tenzen-y
Copy link
Copy Markdown
Member Author

/test all

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 15, 2026
@tenzen-y tenzen-y force-pushed the allow-multiple-flavors-for-node branch from bcba589 to f5595ef Compare May 15, 2026 07:05
@tenzen-y
Copy link
Copy Markdown
Member Author

/test all

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 15, 2026
@tenzen-y tenzen-y marked this pull request as ready for review May 15, 2026 07:35
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 15, 2026
@tenzen-y tenzen-y force-pushed the allow-multiple-flavors-for-node branch from f5595ef to f1c6d1d Compare May 15, 2026 07:35
@tenzen-y
Copy link
Copy Markdown
Member Author

/hold
for reviewers

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 15, 2026
@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 15, 2026

cc @pajakd @PBundyra ptal

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@tenzen-y tenzen-y force-pushed the allow-multiple-flavors-for-node branch from f1c6d1d to 4ab4a93 Compare May 15, 2026 15:24
Comment thread pkg/cache/scheduler/snapshot.go Outdated
if features.Enabled(features.TopologyAwareScheduling) {
for flavor, cache := range c.tasCache.Clone() {
tasSnapshots[flavor] = cache.snapshot(log, c.tasCache.nodesCache.find(cache.flavor.NodeLabels, cache.topology.Levels))
flavorClone := c.tasCache.Clone()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move it closer to its first usage?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed it in 5290147

Comment on lines +201 to +205
// One shared assumedUsage map per TopologyName for this workload. PodSets
// landing on different sibling flavors (same Topology, hostname leaf)
// reserve against the same map, preventing intra-workload self-overlap on
// a shared physical node. Cache-write-time aggregation does not cover
// in-flight reservations because pending workloads have not hit addUsage.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tested in the test "multi-PodSet workload across sibling flavors must not self-overlap on shared node", right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly, right.

}
})

ginkgo.It("should admit the pending workload to tas-flavor-b on the free node when tas-flavor-a is full", func() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this test fail without the fix?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wantCalls: 3,
wantCollected: map[string]int{"a": 1, "b": 2, "c": 3},
},
"early termination: stops on first false": {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we check the value of wantCollected in this test? This is the only test where it should differ from seed, no?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good point. You're right.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this in 70b8434

The seed is map. So, the expectation is not stable. Hence, I added a verification if the collected map key-value pairs match with seed's one.

Comment thread pkg/util/maps/maps.go
return old, existed
}

func (dwc *SyncMap[K, V]) Range(f func(key K, value V) bool) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this function holds only rlock, if someone would try to call it with f(k,v) that modifies the underlying map, it could be dangerous (deadlock?). Shouldn't we add a comment warning about this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point.
I'm wondering if we should take RW lock instead of RLock here.

Any preference adding comments vs taking RW lock?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added a comment in 8bfa984

Because Writable Range will cause performance regressions for syncMap.

for flavor, cache := range flavorClone {
nodes := c.tasCache.nodesCache.find(cache.flavor.NodeLabels, cache.topology.Levels)
var sharedUsage map[utiltas.TopologyDomainID]resources.Requests
if cache.topology.Usage != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the loop above we don't initialize the sharedSnapshotUsage[topologyName] if usageLength == 0. But here we don't check for that. But if a topology has 0 usage, it might still need the shared map (for example if the first workload has multiple podsets)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, we should initialize that when sharedSnapshotUsage[topologyName] was not initialized in the previous loop block due to the usageLength.

Copy link
Copy Markdown
Member Author

@tenzen-y tenzen-y May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed this in 1be5bd2

The sharedSnapshotUsage[topologyName] is always initialized even when the length of usageLength is zero.

tasFlavorCache := c.TASFlavors[tasFlavor]
flvResult := tasFlavorCache.FindTopologyAssignmentsForFlavor(flavorTASRequests, options...)
flvOpts := options
if tasFlavorCache != nil && tasFlavorCache.isLowestLevelNode {
Copy link
Copy Markdown
Contributor

@pajakd pajakd May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if tasFlavorCache would be nil it would panic in the line below this if. The comment above says that "tasFlavor is already in the snapshot" so I think tasFlavorCache != nil should either be removed entirely or changed to a guard clause.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That totally makes sense. Let's remove tasFlavorCache != nil because we can expect that it has already been checked as described in the above code comment.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I address it in 3afdebb

tenzen-y added 4 commits May 19, 2026 01:19
…cMapRange UT

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@tenzen-y tenzen-y force-pushed the allow-multiple-flavors-for-node branch from bb9dd0a to 3afdebb Compare May 18, 2026 17:05
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@tenzen-y
Copy link
Copy Markdown
Member Author

@pajakd Thank you for your review, I addressed all your comments, PTAL, thank you 🙏

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 18, 2026

@tenzen-y please rebase and I think it would be good to squeeze the commits in case we cherrypick

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 18, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Comment on lines +195 to +208
c.tasCache.RLock()
for topologyName, info := range c.tasCache.topologies {
if info.Usage == nil {
continue
}
clonedUsage := make(map[utiltas.TopologyDomainID]resources.Requests, info.Usage.Len())

info.Usage.Range(func(domainID utiltas.TopologyDomainID, req resources.Requests) bool {
clonedUsage[domainID] = req.Clone()
return true
})
sharedSnapshotUsage[topologyName] = clonedUsage
}
c.tasCache.RUnlock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exract this code to a helper function so that we can use the Lock, defer UnLock pattern

Copy link
Copy Markdown
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the amount of non-trivial changes I would consider a feature gate for the fix, like TASHandleOverlappingFlavors

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 18, 2026

cc @Ladicle maybe could also give it a pass who already have pretty solid understanding of TAS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/tas Topology-Aware Scheduling cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TAS: Kueue over-subscribes Node capacities when a Node belongs to multiple ResourceFlavors

4 participants