Workload Aware Scheduling proof of concept by marosset · Pull Request #4723 · ray-project/kuberay

marosset · 2026-04-16T18:03:24Z

Why are these changes needed?

This PR adds support for the Kubernetes native workload aware scheduling.

This is part of
#4344

And a design proposal is also available at

https://docs.google.com/document/d/1I9MtPkBMIj-67ee8abFK3_jS-JcosW1vnbzeDQ6MaKg

important note

Many of the changes in here are related to the golang / k8s client-go version bumps needed to use the new scheduling APIs introduced in K8s v1.36.

I have a seperate set of changes for this at #4703 and will drop the first several commits from this PR and rebase once that change merges.

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

marosset · 2026-04-16T18:05:10Z

cc @Future-Outlier @jackfrancis @troychiu @seanlaii @helayoty @andrewsykim

marosset · 2026-04-16T18:06:39Z

@mm4tt as well

jackfrancis · 2026-04-16T18:20:20Z

FYI we have some E2E tests scaffolded out here in the CAPZ project: kubernetes-sigs/cluster-api-provider-azure#6227

andrewsykim · 2026-04-16T19:36:47Z

Thanks for the PR Mark, will find some time to review it soon.

Were there any notable friction points or instances where Workload API integration was incompatible or awkward to use in KubeRay?

Future-Outlier

Hi, @marosset,

Would you mind taking a look at @troychiu’s demo POC for the workload API here?
master...troychiu:kuberay:kubecon-demo-gang-scheduling-dra

I’m wondering why Troy only wrote around 500 lines of changes, while yours has more than 5,000 lines.

marosset · 2026-04-16T22:10:48Z

Hi, @marosset,

Would you mind taking a look at @troychiu’s demo POC for the workload API here? master...troychiu:kuberay:kubecon-demo-gang-scheduling-dra

I’m wondering why Troy only wrote around 500 lines of changes, while yours has more than 5,000 lines.

~3000 lines are just new tests (unit and e2e)
2000+ in ray-operator/controllers/ray/native_workload_scheduling_test.go
another 900 in ray-operator/test/e2enativescheduling/raycluster_nativescheduling_test.go
And the PR to do the go updates is also ~1200 lines (due to all of the ptr.to refs ->new as part of the golang v1.25 -> golang v1.26 golangci-lint --fix fixes)

Most of the actual logic is in ray-operator/controllers/ray/native_workload_scheduling.go which is ~500 lines.

@Future-Outlier do you have any concerns with specifics of the changes or shock from the PR size?

Future-Outlier · 2026-04-16T22:15:55Z

Hi, @marosset,
Would you mind taking a look at @troychiu’s demo POC for the workload API here? master...troychiu:kuberay:kubecon-demo-gang-scheduling-dra
I’m wondering why Troy only wrote around 500 lines of changes, while yours has more than 5,000 lines.

~3000 lines are just new tests (unit and e2e) 2000+ in ray-operator/controllers/ray/native_workload_scheduling_test.go another 900 in ray-operator/test/e2enativescheduling/raycluster_nativescheduling_test.go And the PR to do the go updates is also ~1200 lines (due to all of the ptr.to refs ->new as part of the golang v1.25 -> golang v1.26 golangci-lint --fix fixes)

Most of the actual logic is in ray-operator/controllers/ray/native_workload_scheduling.go which is ~500 lines.

@Future-Outlier do you have any concerns with specifics of the changes or shock from the PR size?

I think for POC we just need

ray-operator/controllers/ray/native_workload_scheduling.go
provide an example for use to reviewers on kind (k8s cluster) will be good enough
since this PR is not going to be merge (just a POC) other files seem necessary

andrewsykim · 2026-04-17T01:45:07Z

since this PR is not going to be merge (just a POC) other files seem necessary

The immediate high-priority goal of this PR is to prototype and prove that the new Workload API is compatible with KubeRay and identify any changes requried in Workload API. However, I do think we have the intention of merging this PR eventually. I think having the other files is ok with that in mind, I will probably not review all the tests in the near term though if I'm being honest :)

andrewsykim · 2026-04-17T01:46:19Z

With that said, I'd like to really understand what (if any) feedback we have for upstream Kubernetes for Workload API after implementing the prototype

marosset · 2026-04-17T17:50:47Z

I can move just the changes needed to get the POC working (plus instructions, kind config, etc) into a new branch and open a new PR for folks just wanting to test it out in a branch.

I'm putting together a list of feedback I have for the v1alpha2 api that I'll share here and with the K8s workload-aware-scheduling group.
I do think that introducing mutability (as planned for K8s v.137) will help immensely.
I did run into some issues where some of the resources weren't being cleaned up as expected that I was able to work around but i'll provide more details later (probably early next week)

Future-Outlier · 2026-04-17T22:43:42Z

will put this in my todo list next week

seanlaii · 2026-04-18T05:34:51Z

Thanks for putting this together! I want to make sure I’m understanding the design and Troy’s PoC correctly, and also share a few thoughts.

My current understanding is that, aside from whether this should go through the batch scheduler interface for lifecycle management, the main difference between this PR and Troy’s PoC is the PodGroup granularity.

In this PR, it looks like each worker group gets its own PodGroup (plus one for the head), while in Troy’s implementation the whole RayCluster is treated as a single PodGroup. That feels closer to the existing co-scheduling / scheduler-plugins model, where the cluster is effectively scheduled as a single unit.

One thing I’m wondering about is whether the current Workload API semantics make one approach a better fit than the other. Based on my reading, gang scheduling seems to be applied per PodGroup, rather than across multiple PodGroups as a single all-or-nothing unit. If that understanding is correct, then if the goal is to schedule the whole RayCluster together and make this behavior available to users today, the single-PodGroup approach seems to map more directly to the current API behavior.

On the other hand, if the Workload API eventually supports gang scheduling across multiple PodGroups, then having one PodGroup per worker group would also make sense, especially if there are use cases where users may want gang scheduling semantics at the worker-group level rather than the whole cluster level or supporting Topology-Aware scheduling for each worker group.

Does that sound right? Please feel free to correct me if I’m missing anything. And if this tradeoff has already been discussed somewhere, I’d really appreciate a pointer. Thanks!

marosset · 2026-04-21T23:13:51Z

Thanks for putting this together! I want to make sure I’m understanding the design and Troy’s PoC correctly, and also share a few thoughts.

My current understanding is that, aside from whether this should go through the batch scheduler interface for lifecycle management, the main difference between this PR and Troy’s PoC is the PodGroup granularity.

In this PR, it looks like each worker group gets its own PodGroup (plus one for the head), while in Troy’s implementation the whole RayCluster is treated as a single PodGroup. That feels closer to the existing co-scheduling / scheduler-plugins model, where the cluster is effectively scheduled as a single unit.

One thing I’m wondering about is whether the current Workload API semantics make one approach a better fit than the other. Based on my reading, gang scheduling seems to be applied per PodGroup, rather than across multiple PodGroups as a single all-or-nothing unit. If that understanding is correct, then if the goal is to schedule the whole RayCluster together and make this behavior available to users today, the single-PodGroup approach seems to map more directly to the current API behavior.

On the other hand, if the Workload API eventually supports gang scheduling across multiple PodGroups, then having one PodGroup per worker group would also make sense, especially if there are use cases where users may want gang scheduling semantics at the worker-group level rather than the whole cluster level or supporting Topology-Aware scheduling for each worker group.

Does that sound right? Please feel free to correct me if I’m missing anything. And if this tradeoff has already been discussed somewhere, I’d really appreciate a pointer. Thanks!

Right now the current workload API does not support gang scheduling across multiple PodGroups but that is planned for the very near future (I believe the next K8s release). A big motivation for this work (vs Troy's) was to help validate/shape the workload APIs and after some discussion (in a doc linked to in the issue) the Workload API authors and a few others decided to assign each workgroup to it's own PodGroup and then evolve the Workload APIs. There was also a fair bit of discussion on how autoscaling of work groups would be handled and improvements needed for this are planned for the next K8s release.

marosset · 2026-04-21T23:17:29Z

@Future-Outlier I'm going to be out of office for a few days. I'll clean up the PR when I get back early next week. thanks

Future-Outlier · 2026-04-22T03:29:33Z

@Future-Outlier I'm going to be out of office for a few days. I'll clean up the PR when I get back early next week. thanks

sure no problem, thank you!
we can schedule a meeting to talk about this with kubernetes maintainers when you are back, tks!

mm4tt · 2026-04-23T17:07:53Z

Hey, thanks for putting this together!

I'm putting together a list of feedback I have for the v1alpha2 api that I'll share here and with the K8s workload-aware-scheduling group.

Really looking forward to this!

Regarding PodGroup per RayCluster vs PodGroup per worker group, we definitely recommend the PodGroup per worker group approach. While it currently doesn't have a way to express "gang-of gangs", it's only a temporary limitation and this approach better aligns with future evolution of Workload/PodGroup APIs. Gang scheduling across multiple PodGroups will be available in 1.37, together with multi-level TAS and other features. We started working on that in kubernetes/enhancements#6017. Having PodGroup per worker group that are (in 1.37) grouped together by CompositePodGroup will allow you to fully utilize WAS capabilities. One example that may be interesting to you is TAS and ability to place different worker groups in different topology domains (e.g. different racks).

Regarding PodGroup per RayCluster vs. PodGroup per worker group, we recommend the PodGroup per worker group approach. While it currently lacks a way to express "gang-of-gangs", it's only a temporary limitation, and this approach better aligns with the future evolution of the Workload and PodGroup APIs. Gang scheduling across multiple PodGroups will be available in 1.37, together with multi-level TAS and other features. We started working on that in kubernetes/enhancements#6017. Having a PodGroup per worker group, which will be grouped together by a CompositePodGroup in 1.37, will allow you to fully utilize WAS capabilities. One example that may be interesting to you is TAS and the ability to place different worker groups in different topology domains (e.g., different racks).

seanlaii · 2026-04-23T17:23:27Z

@marosset @mm4tt Thank you both for the detailed explanation and the context on the roadmap!
Given the upcoming support for CompositePodGroup and multi-level TAS in the new future, adopting the "PodGroup per worker group" approach definitely makes the most sense.

mm4tt · 2026-04-23T17:12:47Z

+
+### Spec drift detection
+
+If you change the RayCluster spec (add/remove worker groups, change replica counts), the operator detects the mismatch, deletes the stale Workload and PodGroups, and recreates them from the updated spec.


That also means the pods are recreated, right? This is likely fine for PoC/alpha, but it's not something we want to end up with.

We'll definitely support MinCount mutability in 1.37. The support for adding/removing PodGroupTemplates will likely come later but this shows it's an important one. @macsko who looks into that

mm4tt · 2026-04-23T17:14:52Z

+
+### Suspend and resume
+
+When a RayCluster is suspended, the operator deletes the Workload and PodGroups alongside the pods. On resume, fresh scheduling resources are created with the current spec.


I think deleting PodGroups is the right thing to do. However, Workload is intended to represent a long living user intent (scheduling configuration) and it definitely doesn't have to be deleted on suspension. Right now this might be a workaround for lack of mutability, but I think once we have that solves it would be completely OK to couple Workload object lifecycle with RayCluster object

mm4tt · 2026-04-23T17:15:49Z

+> **Note**: This feature is in early alpha. Both the Kubernetes `scheduling.k8s.io/v1alpha2` API and the KubeRay integration are under active development. Notably, autoscaling is not supported — only fixed-size worker groups are compatible.
+
+- **No autoscaling support**: RayClusters with autoscaling enabled (`enableInTreeAutoscaling: true`) will skip native scheduling with a warning event. Fixed-size worker groups only.
+- **Max 7 worker groups**: The `scheduling.k8s.io/v1alpha2` API allows at most 8 PodGroupTemplates per Workload (1 reserved for the head group).


This limit is currently arbitrary and overall we're ok with increasing it.

What would be a reasonable limit for the RayCluster?

mm4tt · 2026-04-23T17:16:40Z

+
+- **No autoscaling support**: RayClusters with autoscaling enabled (`enableInTreeAutoscaling: true`) will skip native scheduling with a warning event. Fixed-size worker groups only.
+- **Max 7 worker groups**: The `scheduling.k8s.io/v1alpha2` API allows at most 8 PodGroupTemplates per Workload (1 reserved for the head group).
+- **Per-worker-group atomicity only**: Each worker group is scheduled as an independent gang. There is no cross-worker-group atomicity (e.g., "schedule all GPU workers AND all CPU workers or none").


We're working on adding multi-level PodGroup hierarchy in 1.37 that solves exactly this problem in kubernetes/enhancements#6017

mm4tt · 2026-04-23T17:18:04Z

+- **Max 7 worker groups**: The `scheduling.k8s.io/v1alpha2` API allows at most 8 PodGroupTemplates per Workload (1 reserved for the head group).
+- **Per-worker-group atomicity only**: Each worker group is scheduled as an independent gang. There is no cross-worker-group atomicity (e.g., "schedule all GPU workers AND all CPU workers or none").
+- **Mutually exclusive with batch schedulers**: Cannot be used together with `batchScheduler` configuration (Volcano, YuniKorn, etc.). The operator will refuse to start if both are enabled.
+- **Immutable `schedulingGroup` on pods**: The `spec.schedulingGroup` field on pods is immutable. If you enable native scheduling on an already-running cluster, existing pods will not get `schedulingGroup` set. New pods (from scale-up, recreation, or suspend/resume) will be correctly configured.


Correct, and this is not something we plan to change. If we want to support enabling "workload scheduling" on already-running cluster I think the only option we have is to recreate all the pods. Alternatively, we can disallow that and only allow setting this for new clusters.

mm4tt · 2026-04-23T17:23:52Z

+
+### Pods stuck in PreEnqueue
+
+If the kube-apiserver's `GenericWorkload` feature gate is enabled but the kube-scheduler's `GangScheduling` feature gate is **not**, the operator will successfully create Workload and PodGroup resources, but pods will remain stuck in the `PreEnqueue` scheduling gate.


This problem will eventually disappear once we promote the GenericWorkload feature gate to be enabled by default.

yup, i just wanted to call that out because I got stuck :P

mm4tt · 2026-04-23T17:26:22Z

+
+const (
+	// Annotation used to opt-in a RayCluster to native workload scheduling.
+	NativeWorkloadSchedulingAnnotation = "ray.io/native-workload-scheduling"


Annotation for opt-in is fine in PoC/alpha but eventually we should converge on RayCluster side API for enabling/disabling and consuming more advanced WAS features (like topology, preemption, etc.). I started brainstorming that in https://docs.google.com/document/d/1VG7Zto9JYuPG4Anb01WMRryJlfV6met0jgob3T2NjZ4/edit?tab=t.0#heading=h.3s8c0yl3azl9. The doc is not really up2date with where we are, but I plan to open a KEP for that in the coming weeks.

agree, the plan is to have an API field on RayCluster to enable/disable this advanced scheduling

mm4tt · 2026-04-23T17:27:42Z

+	// podGroupProtectionFinalizer is the finalizer added by the Kubernetes scheduler to PodGroups
+	// when the GangScheduling feature gate is enabled on the scheduler (alpha in K8s 1.35).
+	// We remove it explicitly before deleting PodGroups so that deletion is immediate
+	// rather than waiting for the scheduler to process the finalizer removal.


We have this finalizer so PodGroup cannot be removed until all pods pointing to it are removed. Why cannot you just first delete pods and then PodGroup?

troychiu · 2026-04-23T18:25:23Z

Given the current lack of gang scheduling across different PodGroups, I question the utility of pursuing a "PodGroup per worker group" approach, as the primary gang scheduling functionality still wouldn't be operational. Alternatively, we could adopt a "PodGroup per RayCluster" strategy and make necessary updates after the 1.37 enhancement is complete. What are your thoughts on this?

mm4tt · 2026-04-28T10:51:01Z

/cc @tosi3k

marosset · 2026-04-28T16:47:11Z

Given the current lack of gang scheduling across different PodGroups, I question the utility of pursuing a "PodGroup per worker group" approach, as the primary gang scheduling functionality still wouldn't be operational. Alternatively, we could adopt a "PodGroup per RayCluster" strategy and make necessary updates after the 1.37 enhancement is complete. What are your thoughts on this?

@seanlaii @troychiu Here is the KEP for gang of gang scheduling
kubernetes/enhancements#6017

PodGroup per RayCluster would probably make more sense for ray usage but might not be as useful in validating the workload APIs in Kubernetes.
I'm OK either way and it really depends which we want to prioritize (i was leaning more towards K8s API validation but can switch).
cc @mm4tt @andrewsykim

marosset · 2026-04-28T16:50:00Z

Given the current lack of gang scheduling across different PodGroups, I question the utility of pursuing a "PodGroup per worker group" approach, as the primary gang scheduling functionality still wouldn't be operational. Alternatively, we could adopt a "PodGroup per RayCluster" strategy and make necessary updates after the 1.37 enhancement is complete. What are your thoughts on this?

@seanlaii @troychiu Here is the KEP for gang of gang scheduling kubernetes/enhancements#6017

PodGroup per RayCluster would probably make more sense for ray usage but might not be as useful in validating the workload APIs in Kubernetes. I'm OK either way and it really depends which we want to prioritize (i was leaning more towards K8s API validation but can switch). cc @mm4tt @andrewsykim

I just saw above that @mm4tt is also hoping to keep PodGroup per WorkerGroup

Regarding PodGroup per RayCluster vs. PodGroup per worker group, we recommend the PodGroup per worker group approach. While it currently lacks a way to express "gang-of-gangs", it's only a temporary limitation, and this approach better aligns with the future evolution of the Workload and PodGroup APIs. Gang scheduling across multiple PodGroups will be available in 1.37, together with multi-level TAS and other features. We started working on that in kubernetes/enhancements#6017. Having a PodGroup per worker group, which will be grouped together by a CompositePodGroup in 1.37, will allow you to fully utilize WAS capabilities. One example that may be interesting to you is TAS and the ability to place different worker groups in different topology domains (e.g., different racks).

rueian · 2026-04-29T23:56:07Z

Given the current lack of gang scheduling across different PodGroups, I question the utility of pursuing a "PodGroup per worker group" approach, as the primary gang scheduling functionality still wouldn't be operational. Alternatively, we could adopt a "PodGroup per RayCluster" strategy and make necessary updates after the 1.37 enhancement is complete. What are your thoughts on this?

I feel like both approaches can be done concurrently, since the "PodGroup per RayCluster" approach can be implemented behind the batch scheduler interface and won't conflict with the work here.

Future-Outlier · 2026-05-07T03:34:24Z

I'll try to find time take a look before the meeting in 12 hours.

Future-Outlier · 2026-05-07T15:35:08Z

Quick question: will the Workload API be enabled by default in Kubernetes in the future? If so, this would break the backward compatibility of the current kuberay scheduler behavior.

UPDATE: we can let user configure the scheduling behavior
2. If we go with the "one PodGroup per worker group" integration, is this considered a best practice? Is any open source project actually using it this way?

UPDATE: this can cover most usecases, since the most common usecase is 1 raycluster = 1 head pod + 1 worker group

marosset · 2026-05-07T17:45:42Z

+	})
+
+	for _, wg := range instance.Spec.WorkerGroupSpecs {
+		minCount := utils.GetWorkerGroupDesiredReplicas(wg)


@andrewsykim - the current PoC uses GetWorkerGroupDesiredReplias() which should account for numHosts

Signed-off-by: Mark Rossetti <marosset@microsoft.com>

…eduling Signed-off-by: Mark Rossetti <marosset@microsoft.com>

Signed-off-by: Mark Rossetti <marosset@microsoft.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 7cc892c. Configure here.}

cursor · 2026-05-11T22:09:38Z

+				"RayCluster has %d worker groups, but native workload scheduling supports at most %d (%d PodGroupTemplates total, 1 reserved for head)",
+				len(instance.Spec.WorkerGroupSpecs), maxWorkerGroups, schedulingv1alpha2.WorkloadMaxPodGroupTemplates)
+			return fmt.Errorf("RayCluster %s/%s has %d worker groups, exceeding the maximum of %d for native workload scheduling",
+				instance.Namespace, instance.Name, len(instance.Spec.WorkerGroupSpecs), maxWorkerGroups)


Too many worker groups error blocks all pod reconciliation

Medium Severity

When skipReasonTooManyWorkerGroups is triggered, reconcileNativeWorkloadScheduling returns an error rather than returning nil with a warning event like the other skip reasons (autoscaling, batch scheduler). Since this error propagates from reconcilePods, it blocks ALL pod reconciliation — including scaling, deletion, and health management — creating an infinite error-requeue loop that makes the cluster unmanageable. The cluster becomes stuck if a user enables the annotation but has more than 7 worker groups.

Additional Locations (1)

ray-operator/controllers/ray/raycluster_controller.go#L708-L711

^{Reviewed by Cursor Bugbot for commit 7cc892c. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7cc892c5b8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-11T22:12:53Z

+		if skipReason == skipReasonDisabled {
+			return nil
+		}


Remove scheduling artifacts when native scheduling is disabled

When nativeSchedulingSkipReason becomes skipReasonDisabled (for example, the user removes ray.io/native-workload-scheduling: "true" after previously enabling it), this path returns immediately and leaves existing Workload/PodGroup objects behind. Those stale PodGroups can keep the scheduler’s podgroup-protection finalizer and are no longer explicitly cleaned up, because other cleanup paths are gated on native scheduling being currently enabled. This creates orphaned scheduling resources and can leave deletion flows stuck in terminating states instead of fully converging.

Useful? React with 👍 / 👎.

Future-Outlier requested changes Apr 16, 2026

View reviewed changes

Future-Outlier moved this to this week in @Future-Outlier's kuberay project Apr 19, 2026

Future-Outlier added this to @Future-Outlier's kuberay project Apr 19, 2026

mm4tt reviewed Apr 23, 2026

View reviewed changes

marosset force-pushed the workload-poc branch from 2e8fef5 to a04b0a8 Compare May 4, 2026 20:35

rueian requested a review from Future-Outlier May 6, 2026 22:22

marosset commented May 7, 2026

View reviewed changes

marosset added 7 commits May 11, 2026 21:24

Adding NativeWorkloadScheduling feature gate

1ba1be8

Signed-off-by: Mark Rossetti <marosset@microsoft.com>

Register schedulingv1alpha2 scheme in ray-operator

71b3d52

Signed-off-by: Mark Rossetti <marosset@microsoft.com>

Adding RBAC rules for scheudling.k8s.io resources + helm integration

c8b31df

Signed-off-by: Mark Rossetti <marosset@microsoft.com>

Adding functions to build Workload and PodGroup for NativeWorkloadSch…

ffbbd8e

…eduling Signed-off-by: Mark Rossetti <marosset@microsoft.com>

Wiring native_workload_scheduling.go into raycluster_controller.go

3fe5df2

Signed-off-by: Mark Rossetti <marosset@microsoft.com>

NativeWorkloadScheduling startup validation

55d19d2

Signed-off-by: Mark Rossetti <marosset@microsoft.com>

Adding watches for Workload/PodGroup objects

77f89a4

Signed-off-by: Mark Rossetti <marosset@microsoft.com>

marosset force-pushed the workload-poc branch from a04b0a8 to dc7d557 Compare May 11, 2026 21:25

marosset added 4 commits May 11, 2026 21:59

Adding e2e tests for native gang scheudling + some changes

ea19bf3

Signed-off-by: Mark Rossetti <marosset@microsoft.com>

Workergroup lifecycle management

38f2898

Signed-off-by: Mark Rossetti <marosset@microsoft.com>

Scheduling conditions

f44a3ec

Signed-off-by: Mark Rossetti <marosset@microsoft.com>

Some docs

7cc892c

Signed-off-by: Mark Rossetti <marosset@microsoft.com>

marosset force-pushed the workload-poc branch from dc7d557 to 7cc892c Compare May 11, 2026 22:00

marosset marked this pull request as ready for review May 11, 2026 22:06

marosset requested review from MortalHappiness, andrewsykim, kevin85421 and rueian as code owners May 11, 2026 22:06

cursor Bot reviewed May 11, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 11, 2026

View reviewed changes

mm4tt mentioned this pull request May 14, 2026

[WIP] feat(api): KEP-3015: Workload Aware Scheduling for TrainJob kubeflow/trainer#3219

Open


		### Spec drift detection

		If you change the RayCluster spec (add/remove worker groups, change replica counts), the operator detects the mismatch, deletes the stale Workload and PodGroups, and recreates them from the updated spec.


		### Suspend and resume

		When a RayCluster is suspended, the operator deletes the Workload and PodGroups alongside the pods. On resume, fresh scheduling resources are created with the current spec.


		### Pods stuck in PreEnqueue

		If the kube-apiserver's `GenericWorkload` feature gate is enabled but the kube-scheduler's `GangScheduling` feature gate is not, the operator will successfully create Workload and PodGroup resources, but pods will remain stuck in the `PreEnqueue` scheduling gate.

Conversation

marosset commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

important note

Checks

Uh oh!

marosset commented Apr 16, 2026

Uh oh!

marosset commented Apr 16, 2026

Uh oh!

jackfrancis commented Apr 16, 2026

Uh oh!

andrewsykim commented Apr 16, 2026

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

marosset commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Future-Outlier commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewsykim commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewsykim commented Apr 17, 2026

Uh oh!

marosset commented Apr 17, 2026

Uh oh!

Future-Outlier commented Apr 17, 2026

Uh oh!

seanlaii commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marosset commented Apr 21, 2026

Uh oh!

marosset commented Apr 21, 2026

Uh oh!

Future-Outlier commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mm4tt commented Apr 23, 2026

Uh oh!

seanlaii commented Apr 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

troychiu commented Apr 23, 2026

Uh oh!

mm4tt commented Apr 28, 2026

Uh oh!

marosset commented Apr 28, 2026

Uh oh!

marosset commented Apr 28, 2026

Uh oh!

rueian commented Apr 29, 2026

Uh oh!

Future-Outlier commented May 7, 2026

Uh oh!

Future-Outlier commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

marosset commented Apr 16, 2026 •

edited

Loading

marosset commented Apr 16, 2026 •

edited

Loading

Future-Outlier commented Apr 16, 2026 •

edited

Loading

andrewsykim commented Apr 17, 2026 •

edited

Loading

seanlaii commented Apr 18, 2026 •

edited

Loading

Future-Outlier commented Apr 22, 2026 •

edited

Loading

Future-Outlier commented May 7, 2026 •

edited

Loading