POC: gate RayService zero-downtime upgrade with workload slicing#11264
POC: gate RayService zero-downtime upgrade with workload slicing#11264kevin85421 wants to merge 2 commits into
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kevin85421 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
One or more co-authors of this pull request were not found. You must specify co-authors in commit message trailer via: Supported
Alternatively, if the co-author should not be included, remove the Please update your commit message(s) by doing |
|
Hi @kevin85421. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| @@ -67,6 +67,7 @@ type ClusterUpgradeOptions struct { | |||
| // +kubebuilder:default:=100 | |||
There was a problem hiding this comment.
ray-project/kuberay#4841 hasn't been merged right now.
Switch RayService suspend semantics to KubeRay PR kubernetes-sigs#4841's top-level Spec.Suspend (Kueue's stop switch) while keeping the nested RayClusterSpec.Suspend=true as a persistent template gate, so any child RayCluster KubeRay creates -- including the pending one during a zero-downtime upgrade -- is born suspended. Build PodSets from the live child RayClusters (union by group name, sum counts) so the workload's quota reservation reflects active+pending during the upgrade and routes through EnsureWorkloadSlices as a scale-up/scale-down. Keys stay stable across the 1<->2 transition so the slice chain is preserved. Add a Reconcile post-step that unsuspends child RayClusters once the matching workload slice is admitted, with a race-guard that verifies the admitted slice's PodSet counts already cover the union of children's required counts -- prevents prematurely unsuspending the pending child before the upgrade slice is created. Known POC limitations: - Same group name with different PodSpecs across active/pending uses the first child's template, so quota is computed against that template. - MultiKueue adapter does not yet propagate the new suspend semantics. - No webhook validation guarding the persistent RayClusterSpec.Suspend template gate. Vendored rayservice_types.go is synced from a local kuberay checkout with PR kubernetes-sigs#4841 applied; deepcopy is unchanged since Spec.Suspend is a bool value type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers the happy path (1 CPU/2 GiB ClusterQueue gating the upgrade's pending RayCluster) and reproduction recipes for the three known limitations: heterogeneous PodSpec on the same group name, manual tampering with the persistent RayClusterSpec.Suspend gate, and the unfinished MultiKueue adapter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0bdc7b6 to
990a514
Compare
| // and the post-upgrade tear-down as a scale-down, without falling back to the | ||
| // non-slice path. | ||
| // | ||
| // POC limitation: when two children share a group name with different PodSpecs |
Summary
POC for #11102. Prevents the pending RayCluster created during a zero-downtime RayService upgrade from running before Kueue admits a workload slice that covers its quota demand.
Depends on ray-project/kuberay#4841 (top-level
RayService.Spec.Suspend+ nested template-suspend semantics). The vendoredrayservice_types.gois synced from a local KubeRay checkout with that PR applied.Design
RayService.Spec.RayClusterSpec.Suspend=truetemplate gate at creationSuspend()sets nestedSuspend=true;RunWithPodSetsInfo()deliberately leaves it truePodSets()lists live children via theray.io/originated-from-cr-{name,crd}labels (mirrors KubeRay'sRayServiceRayClustersAssociationOptions), unions PodSets by name and sums counts. Same keys across the 1↔2 child transitions soEnsureWorkloadSliceshandles upgrade as scale-up and post-upgrade as scale-downReconcilepost-stepunsuspendAdmittedChildrenpatchesSpec.Suspend=falseon each child once the latest workload slice is admitted; a race-guard checks the slice's PodSet counts cover the current children's required counts before patching, so the pending child stays gated while the new slice is still pendingSuspend()also sets top-levelSpec.Suspend=true; KubeRay deletes all owned resourcesPrerequisites for testing
ElasticJobsViaWorkloadSlices=trueon the Kueue manager.kueue.x-k8s.io/elastic-job: "true"annotation on the RayService.See
POC-TESTING.mdfor full build/deploy and the happy-path + limitation reproductions.Known limitations
PodSets()keeps the first child's template, so resource-changing upgrades under-/over-account quota. Same-resource upgrades (rayVersion / image / env) are exact.Spec.RayClusterSpec.Suspend, breaking the gate.rayservice_multikueue_adapter.gohasn't been taught the new suspend semantics.Test plan
POC-TESTING.md.rayVersionbump; releasing quota lets the upgrade complete.🤖 Generated with Claude Code