Account for KubeRay Redis cleanup resources#11260
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: nerdeveloper The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @mimowo, could you please take a look when you have a chance? |
10f92dc to
71f1c08
Compare
| } | ||
|
|
||
| func hasGCSFaultTolerance(rayClusterSpec *rayv1.RayClusterSpec) bool { | ||
| return rayClusterSpec.GcsFaultToleranceOptions != nil |
There was a problem hiding this comment.
Could you check with the KubeRay code is this is a sufficient check? I remember that there was also a knob at the kuberay deployment that would be the global enabler, which is disabled would make this check irrelevant.
| podSets = append(podSets, workerPodSet) | ||
| } | ||
|
|
||
| if hasGCSFaultTolerance(rayClusterSpec) { |
There was a problem hiding this comment.
Since this is a delicate change let's introduce a feature gate as a bailout option, say KubeRayAccountForRedisCleanup which is Beta, and add a comment o GA in 0.21 in Kueue.
mimowo
left a comment
There was a problem hiding this comment.
cc @yaroslava-serdiuk ptal
What this PR does / why we need it
When KubeRay GCS fault tolerance is enabled, KubeRay creates a Redis cleanup Job during RayCluster deletion. Kueue previously built Workload PodSets only for the Ray head and workers, so the cleanup Job resources were not reserved.
This adds a synthetic redis-cleanup PodSet with the same 200m CPU and 256Mi memory requests used by KubeRay, and updates RayCluster/RayService PodSet count handling and validation.
Which issue(s) this PR fixes
Fixes #10946
Special notes for your reviewer
Verified with a local Kind repro that the Workload now includes head, workers, and redis-cleanup PodSets, with ClusterQueue reservation cpu=2200m and memory=1280Mi for the minimal repro.
Does this PR introduce a user-facing change?
Additional documentation
None