From c4db296ba64fdb5c923751a4d70d7d2db5bcfb65 Mon Sep 17 00:00:00 2001 From: Andrey Velichkevich Date: Tue, 17 Feb 2026 19:04:20 +0000 Subject: [PATCH 1/9] feat(api): KEP-3015: Workload Aware Scheduling for TrainJob Signed-off-by: Andrey Velichkevich --- .../3015-workload-aware-scheduling/README.md | 688 ++++++++++++++++++ 1 file changed, 688 insertions(+) create mode 100644 docs/proposals/3015-workload-aware-scheduling/README.md diff --git a/docs/proposals/3015-workload-aware-scheduling/README.md b/docs/proposals/3015-workload-aware-scheduling/README.md new file mode 100644 index 0000000000..a07992611d --- /dev/null +++ b/docs/proposals/3015-workload-aware-scheduling/README.md @@ -0,0 +1,688 @@ +# KEP-3015: Workload Aware Scheduling for TrainJob + +## Summary + +This document proposes integrating the Kubernetes Workload API into Kubeflow Trainer to enable +native workload aware scheduling for TrainJobs. The Workload API, introduced in Kubernetes v1.35 (alpha) +and targeting v1.37 (beta), provides multiple features to enhance AI workload scheduling orchestration +including gang-scheduling: [KEP-4671](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4671-gang-scheduling), +topology-aware scheduling: [KEP-5732](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/5732-topology-aware-workload-scheduling), +DRA: [KEP-5729](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/5729-resourceclaim-support-for-workloads), +and other features through the Workload and PodGroup resources. + +This integration will be implemented as a new `PodGroupPolicy` plugin following the existing Trainer +Pipeline Framework pattern. + +## Motivation + +In the initial implementation, we plan to support gang-scheduling to ensure that all Pods of a +TrainJob are scheduled together, preventing resource deadlock and idle GPU time. Currently, +Kubeflow Trainer supports gang scheduling through external solutions: + +- **Coscheduling plugin**: Requires installing the Kubernetes scheduler-plugins project +- **Volcano scheduler**: Requires deploying the Volcano scheduling system + +The Kubernetes community has recognized gang scheduling as a fundamental requirement and introduced +native support through the Workload API. This API is becoming the standard way to express gang +scheduling in Kubernetes, with integration planned for: + +- **Job controller**: [KEP-5547](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4671-gang-scheduling): Automatic Workload/PodGroup creation for parallel jobs +- **JobSet**: [KEP-969](https://github.com/kubernetes-sigs/jobset/pull/1068): Gang scheduling support for job groups + +### Goals + +1. Implement a Workload plugin following the Trainer Pipeline Framework pattern to support gang scheduling via the native Kubernetes Workload API +1. Support automatic creation of Workload and PodGroup objects with proper lifecycle management tied to TrainJob +1. Ensure gang-scheduling works with MPI plugin +1. Support for TrainJob initializers +1. Maintain backward compatibility with existing Coscheduling and Volcano plugins + +### Non-Goals + +1. Replace existing Coscheduling or Volcano plugins - this is an additional scheduling option +1. Support all Workload API features immediately - focus on core gang scheduling first +1. Support Kubernetes versions < 1.35 - Workload API requires v1.35+ +1. Delegate Workload creation to JobSet - the TrainJob controller (as the highest-level controller) + must create the Workload object + +## Proposal + +### User Stories + +#### Story 1 + +As a platform engineer, I want to configure ClusterTrainingRuntime to use native Kubernetes +gang-scheduling without installing external scheduler plugins. + +The ClusterTrainingRuntime and TrainJob may look as follows: + +```yaml +apiVersion: trainer.kubeflow.org/v1alpha1 +kind: ClusterTrainingRuntime +metadata: + name: torch-distributed +spec: + mlPolicy: + numNodes: 1 + torch: {} + podGroupPolicy: + workload: {} + template: + spec: + replicatedJobs: + - name: node + template: + spec: + template: + metadata: + labels: + trainer.kubeflow.org/trainjob-ancestor-step: trainer + spec: + containers: + - name: node + image: pytorch/pytorch:2.9.1-cuda12.8-cudnn9-runtime +--- +apiVersion: trainer.kubeflow.org/v1alpha1 +kind: TrainJob +metadata: + name: my-job +spec: + runtimeRef: + name: torch-distributed + trainer: + image: docker.io/torch-run + numNodes: 100 + resourcesPerNode: + requests: + nvidia.com/gpu: 4 +``` + +When this plugin is enabled, the TrainJob controller will create the following resources: + +```yaml +apiVersion: scheduling.k8s.io/v1alpha2 +kind: Workload +metadata: + name: my-job + ownerReferences: + - apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job +spec: + controllerRef: + apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job + podGroupTemplates: + - name: trainer + schedulingPolicy: + gang: + minCount: 100 # Equal to trainJob.spec.trainer.numNodes or mlPolicy.numNodes +``` + +The PodGroup will be created automatically by the TrainJob controller + +```yaml +apiVersion: scheduling.k8s.io/v1alpha1 +kind: PodGroup +metadata: + name: my-job-trainer- + ownerReferences: + - apiVersion: scheduling.k8s.io/v1alpha2 + kind: Workload + name: + - apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job +spec: + podGroupTemplateRef: + workloadName: my-job + podGroupTemplateName: trainer + schedulingPolicy: + gang: + minCount: 100 +``` + +And the Pod specs will be updated with the scheduling group: + +```yaml +spec: + schedulingGroup: + podGroupName: my-job-trainer- +``` + +#### Story 2 + +As a platform engineer, I want to configure MPI-based distributed training with gang scheduling using +the Workload API to ensure all MPI nodes are scheduled together. + +The ClusterTrainingRuntime and TrainJob may look as follows: + +```yaml +apiVersion: trainer.kubeflow.org/v1alpha1 +kind: ClusterTrainingRuntime +metadata: + name: deepspeed-distributed + labels: + trainer.kubeflow.org/framework: deepspeed +spec: + mlPolicy: + numNodes: 1 + mpi: + numProcPerNode: 4 + mpiImplementation: OpenMPI + template: + spec: + network: + publishNotReadyAddresses: true + successPolicy: + operator: All + targetReplicatedJobs: + - launcher + replicatedJobs: + - name: launcher + template: + metadata: + labels: + trainer.kubeflow.org/trainjob-ancestor-step: trainer + spec: + template: + spec: + containers: + - name: node + image: ghcr.io/kubeflow/trainer/deepspeed-runtime + securityContext: + runAsUser: 1000 + - name: node + template: + spec: + template: + spec: + containers: + - name: node + image: ghcr.io/kubeflow/trainer/deepspeed-runtime + securityContext: + runAsUser: 1000 + command: + - /usr/sbin/sshd + args: + - -De + - -f + - /home/mpiuser/.sshd_config + readinessProbe: + tcpSocket: + port: 2222 + initialDelaySeconds: 5 +--- +apiVersion: trainer.kubeflow.org/v1alpha1 +kind: TrainJob +metadata: + name: my-job +spec: + runtimeRef: + name: deepspeed-distributed + trainer: + numNodes: 50 + resourcesPerNode: + requests: + nvidia.com/gpu: 4 +``` + +When this plugin is enabled, the TrainJob controller will create the Workload with one PodGroup +for launcher+node. + +```yaml +apiVersion: scheduling.k8s.io/v1alpha2 +kind: Workload +metadata: + name: my-job + ownerReferences: + - apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job +spec: + controllerRef: + apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job + podGroupTemplates: + - name: trainer + schedulingPolicy: + gang: + minCount: 50 # Equal to trainJob.spec.trainer.numNodes +``` + +The corresponding PodGroups will be created: + +```yaml +apiVersion: scheduling.k8s.io/v1alpha1 +kind: PodGroup +metadata: + name: my-job-trainer- + ownerReferences: + - apiVersion: scheduling.k8s.io/v1alpha2 + kind: Workload + name: my-job + - apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job +spec: + podGroupTemplateRef: + workloadName: my-job + podGroupTemplateName: trainer + schedulingPolicy: + gang: + minCount: 50 +``` + +And the Pod specs will be updated with the scheduling group: + +```yaml +spec: + schedulingGroup: + podGroupName: my-job-trainer- +``` + +#### Story 3 + +As a platform engineer, I want to configure LLM fine-tuning with dataset/model initializers and +gang scheduling. The initializers and trainer should have separate PodGroups to ensure proper +scheduling coordination. + +The ClusterTrainingRuntime with Workload API enabled and TrainJob may look as follows: + +```yaml +apiVersion: trainer.kubeflow.org/v1alpha1 +kind: ClusterTrainingRuntime +metadata: + name: torchtune-qwen2.5-1.5b + labels: + trainer.kubeflow.org/framework: torchtune +spec: + mlPolicy: + numNodes: 1 + torch: {} + podGroupPolicy: + workload: {} + template: + spec: + volumeClaimPolicies: + - templates: + - metadata: + name: initializer + spec: + accessModes: ["ReadWriteOnce"] + resources: + requests: + storage: 20Gi + replicatedJobs: + - name: dataset-initializer + template: + metadata: + labels: + trainer.kubeflow.org/trainjob-ancestor-step: dataset-initializer + spec: + template: + spec: + containers: + - name: dataset-initializer + image: ghcr.io/kubeflow/trainer/dataset-initializer + env: + - name: STORAGE_URI + value: hf://tatsu-lab/alpaca + volumeMounts: + - mountPath: /workspace + name: initializer + - name: model-initializer + template: + metadata: + labels: + trainer.kubeflow.org/trainjob-ancestor-step: model-initializer + spec: + template: + spec: + containers: + - name: model-initializer + image: ghcr.io/kubeflow/trainer/model-initializer + env: + - name: STORAGE_URI + value: hf://Qwen/Qwen2.5-1.5B-Instruct + volumeMounts: + - name: initializer + mountPath: /workspace + - name: node + dependsOn: + - name: dataset-initializer + status: Complete + - name: model-initializer + status: Complete + template: + metadata: + labels: + trainer.kubeflow.org/trainjob-ancestor-step: trainer + spec: + template: + spec: + containers: + - name: node + image: ghcr.io/kubeflow/trainer/torchtune-trainer + command: + - tune + - run + - full_finetune_distributed + - --config + - qwen2_5/1.5B_full + - dataset.source=parquet + - dataset.data_dir=/workspace/dataset/data + - output_dir=/workspace/output + - tokenizer.path=/workspace/model/vocab.json + - tokenizer.merges_file=/workspace/model/merges.txt + - checkpointer.checkpoint_dir=/workspace/model + resources: + limits: + nvidia.com/gpu: 2 + volumeMounts: + - mountPath: /workspace + name: initializer +--- +apiVersion: trainer.kubeflow.org/v1alpha1 +kind: TrainJob +metadata: + name: my-job +spec: + runtimeRef: + name: torchtune-qwen2.5-1.5b +``` + +When this plugin is enabled, the TrainJob controller will create the Workload with two PodGroup +templates - one for initializer and one for the trainer. The initializer PodGroup does not require +gang-scheduling to enable lazy-loading: + +```yaml +apiVersion: scheduling.k8s.io/v1alpha2 +kind: Workload +metadata: + name: my-job + ownerReferences: + - apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job +spec: + controllerRef: + apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job + podGroupTemplates: + - name: initializer + schedulingPolicy: + basic: {} + - name: trainer + schedulingPolicy: + gang: + minCount: 8 # Equal to trainJob.spec.trainer.numNodes +``` + +The corresponding PodGroups will be created: + +```yaml +apiVersion: scheduling.k8s.io/v1alpha1 +kind: PodGroup +metadata: + name: my-job-initializer- + ownerReferences: + - apiVersion: scheduling.k8s.io/v1alpha2 + kind: Workload + name: my-job + - apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job +spec: + podGroupTemplateRef: + workloadName: my-job + podGroupTemplateName: initializer + schedulingPolicy: + basic: {} +--- +apiVersion: scheduling.k8s.io/v1alpha1 +kind: PodGroup +metadata: + name: my-job-trainer- + ownerReferences: + - apiVersion: scheduling.k8s.io/v1alpha2 + kind: Workload + name: my-job + - apiVersion: trainer.kubeflow.org/v1alpha1 + kind: TrainJob + name: my-job +spec: + podGroupTemplateRef: + workloadName: my-job + podGroupTemplateName: trainer + schedulingPolicy: + gang: + minCount: 8 +``` + +Each Pod will be associated with its respective PodGroup: + +```yaml +# Initializer Pods (dataset-initializer, model-initializer) +spec: + schedulingGroup: + podGroupName: my-job-initializer- +--- +# Trainer Pod +spec: + schedulingGroup: + podGroupName: my-job-trainer- +``` + +## Design Details + +### Kubernetes Workload API Overview + +The Workload API introduces two new resource types: + +- **Workload**: A static template defining scheduling policies and `PodGroupTemplates`. +- **PodGroup**: Runtime instances representing actual pod groups with status tracking. + +The key design principle is that **the highest-level controller creates the Workload object**. +Since TrainJob is the top-level resource in Kubeflow Trainer, the TrainJob controller must create +the Workload object, not JobSet. + +### Workload Scheduling API + +We extend `PodGroupPolicySource` to include a Workload option: + +```go +// PodGroupPolicySource configures the gang-scheduling plugin. +// Only one of its members may be specified. +type PodGroupPolicySource struct { + // Coscheduling plugin from the Kubernetes scheduler-plugins for gang-scheduling. + Coscheduling *CoschedulingPodGroupPolicySource `json:"coscheduling,omitempty"` + + // Volcano plugin from the Volcano scheduler for gang-scheduling. + Volcano *VolcanoPodGroupPolicySource `json:"volcano,omitempty"` + + // Workload plugin using native Kubernetes Workload API for gang-scheduling + // Requires Kubernetes v1.35+ with GenericWorkload and GangScheduling feature gates enabled. + Workload *WorkloadPodGroupPolicySource `json:"workload,omitempty"` +} + +// WorkloadPodGroupPolicySource configures scheduling behavior using Kubernetes Workload API. +type WorkloadPodGroupPolicySource struct {} +``` + +The `WorkloadPodGroupPolicySource` struct is intentionally minimal for the initial implementation. +The `minCount` for gang scheduling is automatically derived from `mlPolicy.numNodes` or +`trainJob.spec.trainer.numNodes`. + +### Workload Runtime Plugin + +Similar to the Coscheduling and Volcano plugins, we implement a Workload plugin in +`pkg/runtime/framework/plugins/workload/workload.go`. This plugin implements the following interfaces +from the Pipeline Framework: + +#### Build Phase + +The plugin implements the `ComponentBuilder` interface to build the Workload object: + +```go +func (w *Workload) Build(ctx context.Context, info *runtime.Info, trainJob *trainv1alpha1.TrainJob) (client.Object, error) { + // 1. Extract numNodes from runtime info + // 2. Build Workload object with: + // - controllerRef pointing to TrainJob + // - podGroupTemplate with gang scheduling policy + // - minCount equal to numNodes + // 3. Build PodGroup objects. + // 4. Return Workload object for deployment +} +``` + +#### EnforcePodGroupPolicy Phase + +The plugin implements the `EnforcePodGroupPolicy` interface to configure the `schedulingGroup` field in Pod specs: + +```go +func (w *Workload) EnforcePodGroupPolicy(info *runtime.Info, trainJob *trainv1alpha1.TrainJob) error { + // 1. Set schedulingGroup.podGroupName in all Pod templates +} +``` + +#### WatchExtension Phase + +The plugin implements `WatchExtension` to watch Workload resources and trigger TrainJob reconciliation: + +```go +func (w *Workload) ReconcilerBuilders() []runtime.ReconcilerBuilder { + // 1. Watch Workload and PodGroup resources owned by TrainJob + // 2. Trigger reconciliation on PodGroup status changes +} +``` + +### Resource Lifecycle + +1. **Creation**: When a TrainJob is created with `podGroupPolicy.workload` configured, the Workload + plugin creates the Workload and PodGroup objects with `ownerReferences` pointing to the TrainJob. + +1. **Pod Association**: The plugin injects `schedulingGroup.podGroupName` into Pod specs, linking Pods to their PodGroup. + +1. **Scheduling**: The kube-scheduler uses the Workload Scheduling Cycle to process entire PodGroups + atomically, ensuring all Pods in a gang are scheduled together. + +1. **Suspension**: When the TrainJob is suspended, Workload and PodGroup resources are preserved. + In the future, we can re-create to ensure elastic TrainJob support. + +1. **Deletion**: When the TrainJob is deleted, Kubernetes garbage collection automatically cleans up the Workload object (and subsequently the PodGroup). + +The TrainJob controller requires permissions to manage Workload resources: + +```go +// +kubebuilder:rbac:groups=scheduling.k8s.io,resources=workloads,verbs=get;list;watch;create;update;patch +// +kubebuilder:rbac:groups=scheduling.k8s.io,resources=workloads/status,verbs=get +// +kubebuilder:rbac:groups=scheduling.k8s.io,resources=podgroups,verbs=get;list;watch;create;update;patch;list;watch +// +kubebuilder:rbac:groups=scheduling.k8s.io,resources=podgroups/status,verbs=get +``` + +### Feature Gate Dependencies + +The Workload API requires the following Kubernetes feature gates to be enabled: + +- `GenericWorkload`: Enables the Workload and PodGroup APIs + +The same feature gate is required to be enabled in TrainJob config: + +``` +GenericWorkload +``` + +## Test Plan + +- [x] I/we understand the owners of the involved components may require updates to + existing tests to make this code solid enough prior to committing the changes necessary + to implement this enhancement. + +### Unit Tests + +- Workload and PodGroup creation based on the Runtime spec +- Correct calculation of `minCount` +- Proper `controllerRef` and `ownerReferences` configuration +- Pod template injection of `schedulingGroup` field + +### Integration Tests + +- Workload and PodGroup creation by TrainJob controller +- Workload creation for TrainJob with MPI plugin +- Workload creation for TrainJob with initializer +- TrainJob deletion triggers Workload cleanup +- Suspended TrainJob behavior with Workload + +### E2E Tests + +Verify the Workload and PodGroup orchestration as part of Trainer E2E tests. + +## Future Plans + +We plan to integrate other features like DRA, Topology-Aware Scheduling, Preemption as part of +the Workload API in the future. + +For example, we can introduce TopologyConstraint under Workload API to allow users setting topology: + +```yaml +podGroupPolicy: + workload: + topologyConstraint: + level: topology.kubernetes.io/rack +``` + +## Implementation History + +- 2026-02-16: KEP Creation for gang-scheduling support + +## Alternatives + +### Set Workload Spec in the TrainJob API + +We can integrate the Workload spec in the TrainJob API directly. That might introduce in-consistency +with other PodGroupPolicy plugins. Alternatively, we can allow to set Workload parameters in the +Runtime and TrainJob spec, similarly to the Trainer/Initializer. + +```yaml +apiVersion: trainer.kubeflow.org/v1alpha1 +kind: TrainJob +metadata: + name: my-job +spec: + runtimeRef: + name: torch-distributed + workloadSpec: {} +``` + +### Custom Setting for PodGroupTemplates + +We can allow users to override default behavior of Workload API for custom orchestration of TrainJob. +That will require us to extend `.workload` API as follows: + +```yaml +apiVersion: trainer.kubeflow.org/v1alpha1 +kind: ClusterTrainingRuntime +metadata: + name: custom-workload +spec: + podGroupPolicy: + workload: + podGroupTemplates: + - name: trainer + targetJobs: + - name: launcher + - name: node + schedulingPolicy: + gang: + minCount: 10 + - name: evaluator + targetJobs: + - name: evaluator + schedulingPolicy: + gang: + minCount: 2 +``` From 2f5f8a406c02df69e7268844ac51cde2e1f01d1d Mon Sep 17 00:00:00 2001 From: Andrey Velichkevich Date: Wed, 18 Feb 2026 14:45:42 +0000 Subject: [PATCH 2/9] Update docs/proposals/3015-workload-aware-scheduling/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich --- docs/proposals/3015-workload-aware-scheduling/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/proposals/3015-workload-aware-scheduling/README.md b/docs/proposals/3015-workload-aware-scheduling/README.md index a07992611d..d36fdfc50d 100644 --- a/docs/proposals/3015-workload-aware-scheduling/README.md +++ b/docs/proposals/3015-workload-aware-scheduling/README.md @@ -26,7 +26,7 @@ The Kubernetes community has recognized gang scheduling as a fundamental require native support through the Workload API. This API is becoming the standard way to express gang scheduling in Kubernetes, with integration planned for: -- **Job controller**: [KEP-5547](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4671-gang-scheduling): Automatic Workload/PodGroup creation for parallel jobs +- **Job controller**: [KEP-4671](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4671-gang-scheduling): Automatic Workload/PodGroup creation for parallel jobs - **JobSet**: [KEP-969](https://github.com/kubernetes-sigs/jobset/pull/1068): Gang scheduling support for job groups ### Goals From c79eedb3295bd8b2d8df8e1fd60984df39626655 Mon Sep 17 00:00:00 2001 From: Andrey Velichkevich Date: Wed, 18 Feb 2026 14:48:53 +0000 Subject: [PATCH 3/9] Update docs/proposals/3015-workload-aware-scheduling/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andrey Velichkevich --- docs/proposals/3015-workload-aware-scheduling/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/proposals/3015-workload-aware-scheduling/README.md b/docs/proposals/3015-workload-aware-scheduling/README.md index d36fdfc50d..572ab5228e 100644 --- a/docs/proposals/3015-workload-aware-scheduling/README.md +++ b/docs/proposals/3015-workload-aware-scheduling/README.md @@ -528,14 +528,14 @@ from the Pipeline Framework: The plugin implements the `ComponentBuilder` interface to build the Workload object: ```go -func (w *Workload) Build(ctx context.Context, info *runtime.Info, trainJob *trainv1alpha1.TrainJob) (client.Object, error) { +func (w *Workload) Build(ctx context.Context, info *runtime.Info, trainJob *trainv1alpha1.TrainJob) ([]runtime.ApplyConfiguration, error) { // 1. Extract numNodes from runtime info // 2. Build Workload object with: // - controllerRef pointing to TrainJob // - podGroupTemplate with gang scheduling policy // - minCount equal to numNodes // 3. Build PodGroup objects. - // 4. Return Workload object for deployment + // 4. Return a slice of apply configurations for the Workload and PodGroup objects } ``` From e41068c58c3a6a7071d076ba913888c06aeb16de Mon Sep 17 00:00:00 2001 From: Andrey Velichkevich Date: Wed, 18 Feb 2026 14:49:58 +0000 Subject: [PATCH 4/9] Fix feature gates Signed-off-by: Andrey Velichkevich --- docs/proposals/3015-workload-aware-scheduling/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/proposals/3015-workload-aware-scheduling/README.md b/docs/proposals/3015-workload-aware-scheduling/README.md index 572ab5228e..7ac39b144c 100644 --- a/docs/proposals/3015-workload-aware-scheduling/README.md +++ b/docs/proposals/3015-workload-aware-scheduling/README.md @@ -505,7 +505,7 @@ type PodGroupPolicySource struct { Volcano *VolcanoPodGroupPolicySource `json:"volcano,omitempty"` // Workload plugin using native Kubernetes Workload API for gang-scheduling - // Requires Kubernetes v1.35+ with GenericWorkload and GangScheduling feature gates enabled. + // Requires Kubernetes v1.35+ with GenericWorkload feature gates enabled. Workload *WorkloadPodGroupPolicySource `json:"workload,omitempty"` } @@ -580,7 +580,7 @@ The TrainJob controller requires permissions to manage Workload resources: ```go // +kubebuilder:rbac:groups=scheduling.k8s.io,resources=workloads,verbs=get;list;watch;create;update;patch // +kubebuilder:rbac:groups=scheduling.k8s.io,resources=workloads/status,verbs=get -// +kubebuilder:rbac:groups=scheduling.k8s.io,resources=podgroups,verbs=get;list;watch;create;update;patch;list;watch +// +kubebuilder:rbac:groups=scheduling.k8s.io,resources=podgroups,verbs=get;list;watch;create;update;patch // +kubebuilder:rbac:groups=scheduling.k8s.io,resources=podgroups/status,verbs=get ``` @@ -643,7 +643,7 @@ podGroupPolicy: ### Set Workload Spec in the TrainJob API -We can integrate the Workload spec in the TrainJob API directly. That might introduce in-consistency +We can integrate the Workload spec in the TrainJob API directly. That might introduce inconsistency with other PodGroupPolicy plugins. Alternatively, we can allow to set Workload parameters in the Runtime and TrainJob spec, similarly to the Trainer/Initializer. From e9b88f8914a8c0947912e210f02d2f7660ea9595 Mon Sep 17 00:00:00 2001 From: Andrey Velichkevich Date: Wed, 18 Feb 2026 15:00:10 +0000 Subject: [PATCH 5/9] Update Custom Setting for PodGroupTemplates alternative Signed-off-by: Andrey Velichkevich --- docs/proposals/3015-workload-aware-scheduling/README.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/proposals/3015-workload-aware-scheduling/README.md b/docs/proposals/3015-workload-aware-scheduling/README.md index 7ac39b144c..66da61ee06 100644 --- a/docs/proposals/3015-workload-aware-scheduling/README.md +++ b/docs/proposals/3015-workload-aware-scheduling/README.md @@ -660,8 +660,10 @@ spec: ### Custom Setting for PodGroupTemplates -We can allow users to override default behavior of Workload API for custom orchestration of TrainJob. -That will require us to extend `.workload` API as follows: +We could allow users to override the default Workload API behavior to enable custom orchestration for +TrainJob. To support this, we would need to extend the `.workload` API accordingly. +Alternatively, we could rely on [the JobSet integration](https://github.com/kubernetes-sigs/jobset/pull/1068) +to provide users with more fine-grained control over PodGroup configuration. ```yaml apiVersion: trainer.kubeflow.org/v1alpha1 From 5be4948318fc81bbf1219a0b7be80cfad256aa84 Mon Sep 17 00:00:00 2001 From: Andrey Velichkevich Date: Wed, 18 Feb 2026 15:08:58 +0000 Subject: [PATCH 6/9] Add Defaulting/Validation Signed-off-by: Andrey Velichkevich --- docs/proposals/3015-workload-aware-scheduling/README.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/proposals/3015-workload-aware-scheduling/README.md b/docs/proposals/3015-workload-aware-scheduling/README.md index 66da61ee06..aaafe6f4a2 100644 --- a/docs/proposals/3015-workload-aware-scheduling/README.md +++ b/docs/proposals/3015-workload-aware-scheduling/README.md @@ -596,6 +596,13 @@ The same feature gate is required to be enabled in TrainJob config: GenericWorkload ``` +## Defaulting/Validation + +- Workload API should not be configured in JobSet or Job templates +- Only one of `coscheduling`, `volcano`, or `workload` may be specified (CEL) +- `GenericWorkload` feature gate must be enabled (webhook) +- `spec.schedulingGroup` must not be manually set in Pod templates + ## Test Plan - [x] I/we understand the owners of the involved components may require updates to From 551c56796564a78893d040dcc8f826b9e88dbd03 Mon Sep 17 00:00:00 2001 From: Andrey Velichkevich Date: Mon, 23 Feb 2026 15:55:33 +0000 Subject: [PATCH 7/9] Add OwnerReferences Relationship section Signed-off-by: Andrey Velichkevich --- .../3015-workload-aware-scheduling/README.md | 32 +++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/docs/proposals/3015-workload-aware-scheduling/README.md b/docs/proposals/3015-workload-aware-scheduling/README.md index aaafe6f4a2..5edba06616 100644 --- a/docs/proposals/3015-workload-aware-scheduling/README.md +++ b/docs/proposals/3015-workload-aware-scheduling/README.md @@ -108,6 +108,7 @@ metadata: - apiVersion: trainer.kubeflow.org/v1alpha1 kind: TrainJob name: my-job + controller: true spec: controllerRef: apiVersion: trainer.kubeflow.org/v1alpha1 @@ -134,6 +135,7 @@ metadata: - apiVersion: trainer.kubeflow.org/v1alpha1 kind: TrainJob name: my-job + controller: true spec: podGroupTemplateRef: workloadName: my-job @@ -240,6 +242,7 @@ metadata: - apiVersion: trainer.kubeflow.org/v1alpha1 kind: TrainJob name: my-job + controller: true spec: controllerRef: apiVersion: trainer.kubeflow.org/v1alpha1 @@ -266,6 +269,7 @@ metadata: - apiVersion: trainer.kubeflow.org/v1alpha1 kind: TrainJob name: my-job + controller: true spec: podGroupTemplateRef: workloadName: my-job @@ -407,6 +411,7 @@ metadata: - apiVersion: trainer.kubeflow.org/v1alpha1 kind: TrainJob name: my-job + controller: true spec: controllerRef: apiVersion: trainer.kubeflow.org/v1alpha1 @@ -436,6 +441,7 @@ metadata: - apiVersion: trainer.kubeflow.org/v1alpha1 kind: TrainJob name: my-job + controller: true spec: podGroupTemplateRef: workloadName: my-job @@ -454,6 +460,7 @@ metadata: - apiVersion: trainer.kubeflow.org/v1alpha1 kind: TrainJob name: my-job + controller: true spec: podGroupTemplateRef: workloadName: my-job @@ -560,6 +567,31 @@ func (w *Workload) ReconcilerBuilders() []runtime.ReconcilerBuilder { } ``` +### OwnerReferences Relationship + +The ownerReferences relationship between `TrainJob`, `Workload`, `PodGroup`, and `Pod` is as follows: + +```mermaid +flowchart BT + Pod[Pod] + PodGroup[PodGroup] + Workload[Workload] + TrainJob[TrainJob] + + Workload -->|ownerRef| TrainJob + PodGroup -->|ownerRef| TrainJob + Pod -->|ownerRef| TrainJob + + PodGroup -->|ownerRef| Workload + Pod -->|ownerRef| PodGroup +``` + +- The `Workload` object has an ownerReference to the `TrainJob` object with `controller: true` in case it was created by the TrainJob controller +- The `PodGroup` object has an ownerReference to the `TrainJob` object with `controller: true` in case it was created by the TrainJob controller and another ownerReference to the `Workload` object +- The `Pod` object has an ownerReference to the `TrainJob` object with `controller: true` and another ownerReference to the `PodGroup` object + +By this ownerReferences relationship, garbage collection will remove objects accordingly that avoids orphaned Pods with a stale PodGroup reference. + ### Resource Lifecycle 1. **Creation**: When a TrainJob is created with `podGroupPolicy.workload` configured, the Workload From 3833e9575d698c371cf7288af1d6eb941972de50 Mon Sep 17 00:00:00 2001 From: Andrey Velichkevich Date: Wed, 22 Apr 2026 00:14:27 +0100 Subject: [PATCH 8/9] Fix k8s version to 1.36 Signed-off-by: Andrey Velichkevich --- docs/proposals/3015-workload-aware-scheduling/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/proposals/3015-workload-aware-scheduling/README.md b/docs/proposals/3015-workload-aware-scheduling/README.md index 5edba06616..7e4ab090ae 100644 --- a/docs/proposals/3015-workload-aware-scheduling/README.md +++ b/docs/proposals/3015-workload-aware-scheduling/README.md @@ -41,7 +41,7 @@ scheduling in Kubernetes, with integration planned for: 1. Replace existing Coscheduling or Volcano plugins - this is an additional scheduling option 1. Support all Workload API features immediately - focus on core gang scheduling first -1. Support Kubernetes versions < 1.35 - Workload API requires v1.35+ +1. Support Kubernetes versions < 1.35 - Workload API requires v1.36+ 1. Delegate Workload creation to JobSet - the TrainJob controller (as the highest-level controller) must create the Workload object @@ -168,6 +168,8 @@ metadata: labels: trainer.kubeflow.org/framework: deepspeed spec: + podGroupPolicy: + workload: {} mlPolicy: numNodes: 1 mpi: From f57da491c67dbf09447156168c860191bbc818f2 Mon Sep 17 00:00:00 2001 From: Andrey Velichkevich Date: Sat, 16 May 2026 00:04:46 +0100 Subject: [PATCH 9/9] Update docs/proposals/3015-workload-aware-scheduling/README.md Co-authored-by: Vassilis Vassiliadis Signed-off-by: Andrey Velichkevich --- docs/proposals/3015-workload-aware-scheduling/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/proposals/3015-workload-aware-scheduling/README.md b/docs/proposals/3015-workload-aware-scheduling/README.md index 7e4ab090ae..b607eb83c3 100644 --- a/docs/proposals/3015-workload-aware-scheduling/README.md +++ b/docs/proposals/3015-workload-aware-scheduling/README.md @@ -523,8 +523,8 @@ type WorkloadPodGroupPolicySource struct {} ``` The `WorkloadPodGroupPolicySource` struct is intentionally minimal for the initial implementation. -The `minCount` for gang scheduling is automatically derived from `mlPolicy.numNodes` or -`trainJob.spec.trainer.numNodes`. +The `minCount` for gang scheduling is automatically derived from`trainJob.spec.trainer.numNodes` +or `mlPolicy.numNodes`. ### Workload Runtime Plugin