Workload Aware Scheduling proof of concept#4723
Conversation
|
@mm4tt as well |
|
FYI we have some E2E tests scaffolded out here in the CAPZ project: kubernetes-sigs/cluster-api-provider-azure#6227 |
|
Thanks for the PR Mark, will find some time to review it soon. Were there any notable friction points or instances where Workload API integration was incompatible or awkward to use in KubeRay? |
Future-Outlier
left a comment
There was a problem hiding this comment.
Hi, @marosset,
Would you mind taking a look at @troychiu’s demo POC for the workload API here?
master...troychiu:kuberay:kubecon-demo-gang-scheduling-dra
I’m wondering why Troy only wrote around 500 lines of changes, while yours has more than 5,000 lines.
~3000 lines are just new tests (unit and e2e) Most of the actual logic is in ray-operator/controllers/ray/native_workload_scheduling.go which is ~500 lines. @Future-Outlier do you have any concerns with specifics of the changes or shock from the PR size? |
I think for POC we just need
|
The immediate high-priority goal of this PR is to prototype and prove that the new Workload API is compatible with KubeRay and identify any changes requried in Workload API. However, I do think we have the intention of merging this PR eventually. I think having the other files is ok with that in mind, I will probably not review all the tests in the near term though if I'm being honest :) |
|
With that said, I'd like to really understand what (if any) feedback we have for upstream Kubernetes for Workload API after implementing the prototype |
|
I can move just the changes needed to get the POC working (plus instructions, kind config, etc) into a new branch and open a new PR for folks just wanting to test it out in a branch. I'm putting together a list of feedback I have for the v1alpha2 api that I'll share here and with the K8s workload-aware-scheduling group. |
|
will put this in my todo list next week |
|
Thanks for putting this together! I want to make sure I’m understanding the design and Troy’s PoC correctly, and also share a few thoughts. My current understanding is that, aside from whether this should go through the batch scheduler interface for lifecycle management, the main difference between this PR and Troy’s PoC is the PodGroup granularity. In this PR, it looks like each worker group gets its own PodGroup (plus one for the head), while in Troy’s implementation the whole RayCluster is treated as a single PodGroup. That feels closer to the existing co-scheduling / scheduler-plugins model, where the cluster is effectively scheduled as a single unit. One thing I’m wondering about is whether the current Workload API semantics make one approach a better fit than the other. Based on my reading, gang scheduling seems to be applied per PodGroup, rather than across multiple PodGroups as a single all-or-nothing unit. If that understanding is correct, then if the goal is to schedule the whole RayCluster together and make this behavior available to users today, the single-PodGroup approach seems to map more directly to the current API behavior. On the other hand, if the Workload API eventually supports gang scheduling across multiple PodGroups, then having one PodGroup per worker group would also make sense, especially if there are use cases where users may want gang scheduling semantics at the worker-group level rather than the whole cluster level or supporting Topology-Aware scheduling for each worker group. Does that sound right? Please feel free to correct me if I’m missing anything. And if this tradeoff has already been discussed somewhere, I’d really appreciate a pointer. Thanks! |
Right now the current workload API does not support gang scheduling across multiple PodGroups but that is planned for the very near future (I believe the next K8s release). A big motivation for this work (vs Troy's) was to help validate/shape the workload APIs and after some discussion (in a doc linked to in the issue) the Workload API authors and a few others decided to assign each workgroup to it's own PodGroup and then evolve the Workload APIs. There was also a fair bit of discussion on how autoscaling of work groups would be handled and improvements needed for this are planned for the next K8s release. |
|
@Future-Outlier I'm going to be out of office for a few days. I'll clean up the PR when I get back early next week. thanks |
sure no problem, thank you! |
|
Hey, thanks for putting this together!
Really looking forward to this! Regarding PodGroup per RayCluster vs PodGroup per worker group, we definitely recommend the PodGroup per worker group approach. While it currently doesn't have a way to express "gang-of gangs", it's only a temporary limitation and this approach better aligns with future evolution of Workload/PodGroup APIs. Gang scheduling across multiple PodGroups will be available in 1.37, together with multi-level TAS and other features. We started working on that in kubernetes/enhancements#6017. Having PodGroup per worker group that are (in 1.37) grouped together by CompositePodGroup will allow you to fully utilize WAS capabilities. One example that may be interesting to you is TAS and ability to place different worker groups in different topology domains (e.g. different racks). Regarding PodGroup per RayCluster vs. PodGroup per worker group, we recommend the PodGroup per worker group approach. While it currently lacks a way to express "gang-of-gangs", it's only a temporary limitation, and this approach better aligns with the future evolution of the Workload and PodGroup APIs. Gang scheduling across multiple PodGroups will be available in 1.37, together with multi-level TAS and other features. We started working on that in kubernetes/enhancements#6017. Having a PodGroup per worker group, which will be grouped together by a CompositePodGroup in 1.37, will allow you to fully utilize WAS capabilities. One example that may be interesting to you is TAS and the ability to place different worker groups in different topology domains (e.g., different racks). |
|
|
||
| ### Spec drift detection | ||
|
|
||
| If you change the RayCluster spec (add/remove worker groups, change replica counts), the operator detects the mismatch, deletes the stale Workload and PodGroups, and recreates them from the updated spec. |
There was a problem hiding this comment.
That also means the pods are recreated, right? This is likely fine for PoC/alpha, but it's not something we want to end up with.
We'll definitely support MinCount mutability in 1.37. The support for adding/removing PodGroupTemplates will likely come later but this shows it's an important one. @macsko who looks into that
|
|
||
| ### Suspend and resume | ||
|
|
||
| When a RayCluster is suspended, the operator deletes the Workload and PodGroups alongside the pods. On resume, fresh scheduling resources are created with the current spec. |
There was a problem hiding this comment.
I think deleting PodGroups is the right thing to do. However, Workload is intended to represent a long living user intent (scheduling configuration) and it definitely doesn't have to be deleted on suspension. Right now this might be a workaround for lack of mutability, but I think once we have that solves it would be completely OK to couple Workload object lifecycle with RayCluster object
| > **Note**: This feature is in early alpha. Both the Kubernetes `scheduling.k8s.io/v1alpha2` API and the KubeRay integration are under active development. Notably, autoscaling is not supported — only fixed-size worker groups are compatible. | ||
|
|
||
| - **No autoscaling support**: RayClusters with autoscaling enabled (`enableInTreeAutoscaling: true`) will skip native scheduling with a warning event. Fixed-size worker groups only. | ||
| - **Max 7 worker groups**: The `scheduling.k8s.io/v1alpha2` API allows at most 8 PodGroupTemplates per Workload (1 reserved for the head group). |
There was a problem hiding this comment.
This limit is currently arbitrary and overall we're ok with increasing it.
What would be a reasonable limit for the RayCluster?
|
|
||
| - **No autoscaling support**: RayClusters with autoscaling enabled (`enableInTreeAutoscaling: true`) will skip native scheduling with a warning event. Fixed-size worker groups only. | ||
| - **Max 7 worker groups**: The `scheduling.k8s.io/v1alpha2` API allows at most 8 PodGroupTemplates per Workload (1 reserved for the head group). | ||
| - **Per-worker-group atomicity only**: Each worker group is scheduled as an independent gang. There is no cross-worker-group atomicity (e.g., "schedule all GPU workers AND all CPU workers or none"). |
There was a problem hiding this comment.
We're working on adding multi-level PodGroup hierarchy in 1.37 that solves exactly this problem in kubernetes/enhancements#6017
| - **Max 7 worker groups**: The `scheduling.k8s.io/v1alpha2` API allows at most 8 PodGroupTemplates per Workload (1 reserved for the head group). | ||
| - **Per-worker-group atomicity only**: Each worker group is scheduled as an independent gang. There is no cross-worker-group atomicity (e.g., "schedule all GPU workers AND all CPU workers or none"). | ||
| - **Mutually exclusive with batch schedulers**: Cannot be used together with `batchScheduler` configuration (Volcano, YuniKorn, etc.). The operator will refuse to start if both are enabled. | ||
| - **Immutable `schedulingGroup` on pods**: The `spec.schedulingGroup` field on pods is immutable. If you enable native scheduling on an already-running cluster, existing pods will not get `schedulingGroup` set. New pods (from scale-up, recreation, or suspend/resume) will be correctly configured. |
There was a problem hiding this comment.
Correct, and this is not something we plan to change. If we want to support enabling "workload scheduling" on already-running cluster I think the only option we have is to recreate all the pods. Alternatively, we can disallow that and only allow setting this for new clusters.
|
|
||
| ### Pods stuck in PreEnqueue | ||
|
|
||
| If the kube-apiserver's `GenericWorkload` feature gate is enabled but the kube-scheduler's `GangScheduling` feature gate is **not**, the operator will successfully create Workload and PodGroup resources, but pods will remain stuck in the `PreEnqueue` scheduling gate. |
There was a problem hiding this comment.
This problem will eventually disappear once we promote the GenericWorkload feature gate to be enabled by default.
There was a problem hiding this comment.
yup, i just wanted to call that out because I got stuck :P
|
|
||
| const ( | ||
| // Annotation used to opt-in a RayCluster to native workload scheduling. | ||
| NativeWorkloadSchedulingAnnotation = "ray.io/native-workload-scheduling" |
There was a problem hiding this comment.
Annotation for opt-in is fine in PoC/alpha but eventually we should converge on RayCluster side API for enabling/disabling and consuming more advanced WAS features (like topology, preemption, etc.). I started brainstorming that in https://docs.google.com/document/d/1VG7Zto9JYuPG4Anb01WMRryJlfV6met0jgob3T2NjZ4/edit?tab=t.0#heading=h.3s8c0yl3azl9. The doc is not really up2date with where we are, but I plan to open a KEP for that in the coming weeks.
There was a problem hiding this comment.
agree, the plan is to have an API field on RayCluster to enable/disable this advanced scheduling
| // podGroupProtectionFinalizer is the finalizer added by the Kubernetes scheduler to PodGroups | ||
| // when the GangScheduling feature gate is enabled on the scheduler (alpha in K8s 1.35). | ||
| // We remove it explicitly before deleting PodGroups so that deletion is immediate | ||
| // rather than waiting for the scheduler to process the finalizer removal. |
There was a problem hiding this comment.
We have this finalizer so PodGroup cannot be removed until all pods pointing to it are removed. Why cannot you just first delete pods and then PodGroup?
|
Given the current lack of gang scheduling across different PodGroups, I question the utility of pursuing a "PodGroup per worker group" approach, as the primary gang scheduling functionality still wouldn't be operational. Alternatively, we could adopt a "PodGroup per RayCluster" strategy and make necessary updates after the 1.37 enhancement is complete. What are your thoughts on this? |
|
/cc @tosi3k |
@seanlaii @troychiu Here is the KEP for gang of gang scheduling PodGroup per RayCluster would probably make more sense for ray usage but might not be as useful in validating the workload APIs in Kubernetes. |
I just saw above that @mm4tt is also hoping to keep PodGroup per WorkerGroup
|
I feel like both approaches can be done concurrently, since the "PodGroup per RayCluster" approach can be implemented behind the batch scheduler interface and won't conflict with the work here. |
|
I'll try to find time take a look before the meeting in 12 hours. |
UPDATE: we can let user configure the scheduling behavior UPDATE: this can cover most usecases, since the most common usecase is 1 raycluster = 1 head pod + 1 worker group |
| }) | ||
|
|
||
| for _, wg := range instance.Spec.WorkerGroupSpecs { | ||
| minCount := utils.GetWorkerGroupDesiredReplicas(wg) |
There was a problem hiding this comment.
@andrewsykim - the current PoC uses GetWorkerGroupDesiredReplias() which should account for numHosts
Signed-off-by: Mark Rossetti <marosset@microsoft.com>
Signed-off-by: Mark Rossetti <marosset@microsoft.com>
Signed-off-by: Mark Rossetti <marosset@microsoft.com>
…eduling Signed-off-by: Mark Rossetti <marosset@microsoft.com>
Signed-off-by: Mark Rossetti <marosset@microsoft.com>
Signed-off-by: Mark Rossetti <marosset@microsoft.com>
Signed-off-by: Mark Rossetti <marosset@microsoft.com>
Signed-off-by: Mark Rossetti <marosset@microsoft.com>
Signed-off-by: Mark Rossetti <marosset@microsoft.com>
Signed-off-by: Mark Rossetti <marosset@microsoft.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 7cc892c. Configure here.
| "RayCluster has %d worker groups, but native workload scheduling supports at most %d (%d PodGroupTemplates total, 1 reserved for head)", | ||
| len(instance.Spec.WorkerGroupSpecs), maxWorkerGroups, schedulingv1alpha2.WorkloadMaxPodGroupTemplates) | ||
| return fmt.Errorf("RayCluster %s/%s has %d worker groups, exceeding the maximum of %d for native workload scheduling", | ||
| instance.Namespace, instance.Name, len(instance.Spec.WorkerGroupSpecs), maxWorkerGroups) |
There was a problem hiding this comment.
Too many worker groups error blocks all pod reconciliation
Medium Severity
When skipReasonTooManyWorkerGroups is triggered, reconcileNativeWorkloadScheduling returns an error rather than returning nil with a warning event like the other skip reasons (autoscaling, batch scheduler). Since this error propagates from reconcilePods, it blocks ALL pod reconciliation — including scaling, deletion, and health management — creating an infinite error-requeue loop that makes the cluster unmanageable. The cluster becomes stuck if a user enables the annotation but has more than 7 worker groups.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 7cc892c. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7cc892c5b8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if skipReason == skipReasonDisabled { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Remove scheduling artifacts when native scheduling is disabled
When nativeSchedulingSkipReason becomes skipReasonDisabled (for example, the user removes ray.io/native-workload-scheduling: "true" after previously enabling it), this path returns immediately and leaves existing Workload/PodGroup objects behind. Those stale PodGroups can keep the scheduler’s podgroup-protection finalizer and are no longer explicitly cleaned up, because other cleanup paths are gated on native scheduling being currently enabled. This creates orphaned scheduling resources and can leave deletion flows stuck in terminating states instead of fully converging.
Useful? React with 👍 / 👎.


Why are these changes needed?
This PR adds support for the Kubernetes native workload aware scheduling.
This is part of
#4344
And a design proposal is also available at
https://docs.google.com/document/d/1I9MtPkBMIj-67ee8abFK3_jS-JcosW1vnbzeDQ6MaKg
important note
Many of the changes in here are related to the golang / k8s client-go version bumps needed to use the new scheduling APIs introduced in K8s v1.36.
I have a seperate set of changes for this at #4703 and will drop the first several commits from this PR and rebase once that change merges.
Checks