Skip to content

feat: increase Workload PodSet limit to 10#11388

Open
yuluo-yx wants to merge 7 commits into
kubernetes-sigs:mainfrom
yuluo-yx:0521-yuluo/feat
Open

feat: increase Workload PodSet limit to 10#11388
yuluo-yx wants to merge 7 commits into
kubernetes-sigs:mainfrom
yuluo-yx:0521-yuluo/feat

Conversation

@yuluo-yx
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

Which issue(s) this PR fixes:

Fixes #11379

Does this PR introduce a user-facing change?

Increased the maximum number of PodSets per Workload from 8 to 16. As a result, RayCluster, RayJob, and RayService integrations now allow up to 15 worker groups, since one PodSet is reserved for the Ray head group.

Signed-off-by: yuluo-yx <yuluo08290126@gmail.com>
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels May 21, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: yuluo-yx
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from mimowo and windsonsea May 21, 2026 15:22
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 21, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @yuluo-yx. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 21, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 21, 2026

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit 39fb219
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a1112d82b30a50008f5486c
😎 Deploy Preview https://deploy-preview-11388--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@tenzen-y
Copy link
Copy Markdown
Member

/hold

I'd like to figure out if increasing the number of PodSets degrades TAS performance.
Because we can not revert this change afterward.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 21, 2026
@yuluo-yx
Copy link
Copy Markdown
Contributor Author

/hold

I'd like to figure out if increasing the number of PodSets degrades TAS performance. Because we can not revert this change afterward.

Agree. 👀

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 21, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 21, 2026
@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 21, 2026

I'd like to figure out if increasing the number of PodSets degrades TAS performance.
Because we can not revert this change afterward.

The performance of TAS (or general scheduler) does not degrade unless a user requests to have more PodSets. Also, users can scale up Kueue (add more CPU/mem to the kueue deployment), but they cannot workaround validation except for forking Kueue.

Maybe to mitigate the issue we can start with some smaller number like 10 or 12, at the same time to mitigate the additional "technical" PodSets for Ray: #11260 - we want to introduce there a new PodSet for RedisCleanup job and that would mean one less PodSet for "Workers".

wdyt?

@tenzen-y
Copy link
Copy Markdown
Member

I'd like to figure out if increasing the number of PodSets degrades TAS performance.
Because we can not revert this change afterward.

The performance of TAS (or general scheduler) does not degrade unless a user requests to have more PodSets. Also, users can scale up Kueue (add more CPU/mem to the kueue deployment), but they cannot workaround validation except for forking Kueue.

Maybe to mitigate the issue we can start with some smaller number like 10 or 12, at the same time to mitigate the additional "technical" PodSets for Ray: #11260 - we want to introduce there a new PodSet for RedisCleanup job and that would mean one less PodSet for "Workers".

wdyt?

The computation costs are N × M in flavor assignment at worst, where N is the number of flavors and M is the number of PodSets. And in the TAS calculation pays more computation costs because it considers the tree topology structure.

But, I am assuming that 10 has less impact.
Surely, it will degrade TAS performance by increasing the number of pattern matches, but 10 sounds like a reasonable balance between performance and user capabilities.

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 21, 2026

@yuluo-yx could you adjust the PR to 10?

@yuluo-yx
Copy link
Copy Markdown
Contributor Author

@yuluo-yx could you adjust the PR to 10?

got it, I'll update tonight.

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 22, 2026

/retitle feat: increase Workload PodSet limit to 10
As per discussion

@k8s-ci-robot k8s-ci-robot changed the title feat: increase Workload PodSet limit to 16 feat: increase Workload PodSet limit to 10 May 22, 2026
@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 22, 2026

/unhold
as per #11388 (comment)

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 22, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

k8s-ci-robot commented May 22, 2026

@yuluo-yx: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-test-e2e-main-1-36 4c65448 link true /test pull-kueue-test-e2e-main-1-36

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Increase the number of PodSets per Job

4 participants