Skip to content

POC: gate RayService zero-downtime upgrade with workload slicing#11264

Draft
kevin85421 wants to merge 2 commits into
kubernetes-sigs:mainfrom
kevin85421:poc/rayservice-upgrade-quota-gate
Draft

POC: gate RayService zero-downtime upgrade with workload slicing#11264
kevin85421 wants to merge 2 commits into
kubernetes-sigs:mainfrom
kevin85421:poc/rayservice-upgrade-quota-gate

Conversation

@kevin85421
Copy link
Copy Markdown
Contributor

Summary

POC for #11102. Prevents the pending RayCluster created during a zero-downtime RayService upgrade from running before Kueue admits a workload slice that covers its quota demand.

Depends on ray-project/kuberay#4841 (top-level RayService.Spec.Suspend + nested template-suspend semantics). The vendored rayservice_types.go is synced from a local KubeRay checkout with that PR applied.

Design

Action Owner
Child RayCluster born suspended KubeRay reads RayService.Spec.RayClusterSpec.Suspend=true template gate at creation
Persistent template gate Kueue's Suspend() sets nested Suspend=true; RunWithPodSetsInfo() deliberately leaves it true
Quota for active + pending during upgrade PodSets() lists live children via the ray.io/originated-from-cr-{name,crd} labels (mirrors KubeRay's RayServiceRayClustersAssociationOptions), unions PodSets by name and sums counts. Same keys across the 1↔2 child transitions so EnsureWorkloadSlices handles upgrade as scale-up and post-upgrade as scale-down
Release the gate Reconcile post-step unsuspendAdmittedChildren patches Spec.Suspend=false on each child once the latest workload slice is admitted; a race-guard checks the slice's PodSet counts cover the current children's required counts before patching, so the pending child stays gated while the new slice is still pending
Preemption / total stop Suspend() also sets top-level Spec.Suspend=true; KubeRay deletes all owned resources

Prerequisites for testing

See POC-TESTING.md for full build/deploy and the happy-path + limitation reproductions.

Known limitations

  • Heterogeneous PodSpec on the same group name: PodSets() keeps the first child's template, so resource-changing upgrades under-/over-account quota. Same-resource upgrades (rayVersion / image / env) are exact.
  • No webhook validation: a user can manually clear Spec.RayClusterSpec.Suspend, breaking the gate.
  • MultiKueue adapter not updated: rayservice_multikueue_adapter.go hasn't been taught the new suspend semantics.

Test plan

  • Build/deploy per POC-TESTING.md.
  • Happy path: 1 CPU/2 GiB ClusterQueue gates the pending RayCluster during a rayVersion bump; releasing quota lets the upgrade complete.
  • Limitation 1: resource-changing upgrade demonstrates quota miscount.
  • Limitation 2: manually clearing the template gate lets the pending pod run un-admitted.
  • (Out of scope) MultiKueue path.

🤖 Generated with Claude Code

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 17, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 17, 2026

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 990a514
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a0a3ff92da0190008a7e3e1

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kevin85421
Once this PR has been reviewed and has the lgtm label, please assign gabesaba for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 17, 2026
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented May 17, 2026

CLA Missing ID

One or more co-authors of this pull request were not found. You must specify co-authors in commit message trailer via:

Co-authored-by: name <email>

Supported Co-authored-by: formats include:

  1. Anything <id+login@users.noreply.github.com> - it will locate your GitHub user by id part.
  2. Anything <login@users.noreply.github.com> - it will locate your GitHub user by login part.
  3. Anything <public-email> - it will locate your GitHub user by public-email part. Note that this email must be made public on Github.
  4. Anything <other-email> - it will locate your GitHub user by other-email part but only if that email was used before for any other CLA as a main commit author.
  5. login <any-valid-email> - it will locate your GitHub user by login part, note that login part must be at least 3 characters long.

Alternatively, if the co-author should not be included, remove the Co-authored-by: line from the commit message.

Please update your commit message(s) by doing git commit --amend and then git push [--force] and then request re-running CLA check via commenting on this pull request:

/easycla

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @kevin85421. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 17, 2026
@@ -67,6 +67,7 @@ type ClusterUpgradeOptions struct {
// +kubebuilder:default:=100
Copy link
Copy Markdown
Contributor Author

@kevin85421 kevin85421 May 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray-project/kuberay#4841 hasn't been merged right now.

kevin85421 and others added 2 commits May 17, 2026 15:23
Switch RayService suspend semantics to KubeRay PR kubernetes-sigs#4841's top-level
Spec.Suspend (Kueue's stop switch) while keeping the nested
RayClusterSpec.Suspend=true as a persistent template gate, so any child
RayCluster KubeRay creates -- including the pending one during a
zero-downtime upgrade -- is born suspended.

Build PodSets from the live child RayClusters (union by group name, sum
counts) so the workload's quota reservation reflects active+pending
during the upgrade and routes through EnsureWorkloadSlices as a
scale-up/scale-down. Keys stay stable across the 1<->2 transition so the
slice chain is preserved.

Add a Reconcile post-step that unsuspends child RayClusters once the
matching workload slice is admitted, with a race-guard that verifies the
admitted slice's PodSet counts already cover the union of children's
required counts -- prevents prematurely unsuspending the pending child
before the upgrade slice is created.

Known POC limitations:
- Same group name with different PodSpecs across active/pending uses
  the first child's template, so quota is computed against that template.
- MultiKueue adapter does not yet propagate the new suspend semantics.
- No webhook validation guarding the persistent RayClusterSpec.Suspend
  template gate.

Vendored rayservice_types.go is synced from a local kuberay checkout with
PR kubernetes-sigs#4841 applied; deepcopy is unchanged since Spec.Suspend is a bool
value type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers the happy path (1 CPU/2 GiB ClusterQueue gating the upgrade's
pending RayCluster) and reproduction recipes for the three known
limitations: heterogeneous PodSpec on the same group name, manual
tampering with the persistent RayClusterSpec.Suspend gate, and the
unfinished MultiKueue adapter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kevin85421 kevin85421 force-pushed the poc/rayservice-upgrade-quota-gate branch from 0bdc7b6 to 990a514 Compare May 17, 2026 22:23
// and the post-upgrade tear-down as a scale-down, without falling back to the
// non-slice path.
//
// POC limitation: when two children share a group name with different PodSpecs
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants