Skip to content

Account for KubeRay Redis cleanup resources#11260

Open
nerdeveloper wants to merge 1 commit into
kubernetes-sigs:mainfrom
nerdeveloper:fix-ray-gcs-cleanup-quota-10946
Open

Account for KubeRay Redis cleanup resources#11260
nerdeveloper wants to merge 1 commit into
kubernetes-sigs:mainfrom
nerdeveloper:fix-ray-gcs-cleanup-quota-10946

Conversation

@nerdeveloper
Copy link
Copy Markdown
Member

What this PR does / why we need it

When KubeRay GCS fault tolerance is enabled, KubeRay creates a Redis cleanup Job during RayCluster deletion. Kueue previously built Workload PodSets only for the Ray head and workers, so the cleanup Job resources were not reserved.

This adds a synthetic redis-cleanup PodSet with the same 200m CPU and 256Mi memory requests used by KubeRay, and updates RayCluster/RayService PodSet count handling and validation.

Which issue(s) this PR fixes

Fixes #10946

Special notes for your reviewer

Verified with a local Kind repro that the Workload now includes head, workers, and redis-cleanup PodSets, with ClusterQueue reservation cpu=2200m and memory=1280Mi for the minimal repro.

Does this PR introduce a user-facing change?

NONE

Additional documentation

None

@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label May 17, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 17, 2026

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 71f1c08
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a09995c38314100087fc9fc

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nerdeveloper
Once this PR has been reviewed and has the lgtm label, please assign gabesaba for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 17, 2026
@nerdeveloper
Copy link
Copy Markdown
Member Author

Hi @mimowo, could you please take a look when you have a chance?

@nerdeveloper nerdeveloper force-pushed the fix-ray-gcs-cleanup-quota-10946 branch from 10f92dc to 71f1c08 Compare May 17, 2026 10:32
}

func hasGCSFaultTolerance(rayClusterSpec *rayv1.RayClusterSpec) bool {
return rayClusterSpec.GcsFaultToleranceOptions != nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check with the KubeRay code is this is a sufficient check? I remember that there was also a knob at the kuberay deployment that would be the global enabler, which is disabled would make this check irrelevant.

podSets = append(podSets, workerPodSet)
}

if hasGCSFaultTolerance(rayClusterSpec) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a delicate change let's introduce a feature gate as a bailout option, say KubeRayAccountForRedisCleanup which is Beta, and add a comment o GA in 0.21 in Kueue.

Copy link
Copy Markdown
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The Workload corresponding to RayService/RayCluster should reserve quota for RedisCleanup job

3 participants