Finish old workload slice only after new slice is admitted#11195
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Pull request overview
This PR fixes an elastic workload slice scale-up edge case by ensuring the old admitted WorkloadSlice is only finished after the replacement slice successfully reserves quota, preventing a “quota leak” scenario where the Job keeps running unsuspended with no admitted slice.
Changes:
- Move old-slice finishing into the scheduler admission success path (admit first, finish old after).
- Preserve MultiKueue placement by copying
Status.ClusterNamefrom the old slice onto the replacement before admission. - Add/adjust tests to validate the pending-replacement behavior and the resulting event ordering.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| test/integration/singlecluster/controller/jobs/job/job_controller_test.go | Adds an integration test asserting the old slice stays admitted/unfinished while the replacement slice remains pending. |
| pkg/workloadslicing/workloadslicing.go | Clarifies comment wording around explicit eviction causes for finishing an old slice. |
| pkg/scheduler/scheduler.go | Finishes old slice only after successful admission of the replacement; copies ClusterName pre-admission for MultiKueue continuity. |
| pkg/scheduler/scheduler_test.go | Updates expected event ordering to reflect finishing the old slice after admitting the new one. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if err := s.replaceWorkloadSlice(ctx, oldWorkloadSlice.WorkloadInfo.ClusterQueue, e.Obj, oldWorkloadSlice.WorkloadInfo.Obj.DeepCopy()); err != nil { | ||
| log.Error(err, "Failed to replace workload slice") | ||
| return err | ||
| log.Error(err, "Failed to finish old workload slice after admitting replacement; job reconciler will handle recovery") |
There was a problem hiding this comment.
I think this is ok, but IIUC if this fails, then we may observe double counting of the quota for the brief moment - before the JobReconciler finishes the old slices.
The quota bump may result in temporary excessive preemptions of other workloads, or blocking workloads that could be admitted otherwise.
So, I think ideally if the scheduler cache was aware of the fact that both slices are admitted at the same time, and only count capacity from the "max" of them, rather than "sum".
However, that quota bump is very rare (requires request failure), and the consequence is also limited (excessive preemptions for a brief moment), so my comment is mostly to confirm my understanding rather than to request changes.
We may re-evaluate if this is good enough or not before GA graduation.
There was a problem hiding this comment.
Yup, that's right. The double-count only happens if replaceOldWorkloadSlice fails after admission succeeds i.e. the old slice stays admitted in the cache alongside the new one until the job reconciler's EnsureWorkloadSlices finishes the old slice on its next reconcile. This is conservative (over-reports usage, blocks other admits temporarily) and resolves within one reconcile cycle. I agree it's fine for now, and the cache-level max-not-sum optimization is a good candidate for GA if it shows up in practice.
There was a problem hiding this comment.
We had discussed earlier that it could be helpful to "teach" the queue/cache how to exclude the unfinished old slice from quota accounting, so we don't observe an artificial quota increase in the case where updating the old slice fails.
It would be great if we could address that in this PR, but I also think it is reasonable as a follow-up.
In that case, since you now have all the context around this flow 🙂, would you mind creating a follow-up issue and linking it to this PR?
There was a problem hiding this comment.
How about I document this as a GA consideration in the ElasticJobsViaWorkloadSlices KEP instead? It's a pretty narrow failure path (API write failing right after a successful one) and @mimowo was also leaning toward revisiting at GA.
There was a problem hiding this comment.
+1 to updating the KEP, could you do it in this PR, please?
ichekrygin
left a comment
There was a problem hiding this comment.
Changes are LGTM overall, aside from two small nits, nothing blocking from my side.
Thank you for updating this!
| // assuming it in the cache. | ||
| // Note: this does not necessarily make the workload "admitted". | ||
| func (s *Scheduler) admit(ctx context.Context, e *entry, cq *schdcache.ClusterQueueSnapshot) error { | ||
| func (s *Scheduler) admit(ctx context.Context, e *entry, cq *schdcache.ClusterQueueSnapshot, oldWorkloadSlice *preemption.Target) error { |
There was a problem hiding this comment.
Nit: I understand why replaceOldWorkloadSlice needs to happen only after successful admission, but is there a reason it has to be inside admit rather than immediately after a successful admit call?
For example, we could keep admit generic and do:
if err := s.admit(ctx, e, cq); err != nil {
e.inadmissibleMsg = fmt.Sprintf("Failed to admit workload: %v", err)
return
}
if features.Enabled(features.ElasticJobsViaWorkloadSlices) && oldWorkloadSlice != nil {
s.replaceOldWorkloadSlice(ctx, log, e, oldWorkloadSlice)
}That would preserve the important ordering, finish the old slice only after the replacement was successfully admitted, while avoiding exposing admit to the workload-slice replacement concept. Most workloads are not elastic slice replacements, so passing oldWorkloadSlice through admit feels a bit leaky to me.
There was a problem hiding this comment.
I think admit() is async. The admissionRoutineWrapper.Run() launches a goroutine and admit() returns nil immediately before PatchAdmissionStatus completes. If we finish the old slice after admit() returns in processEntry(), we'd be finishing it before admission is actually confirmed. It's the same race we're fixing. That's why it has to live inside the goroutine's success path.
| // admitted. Called inside the admit success path so the old slice is only | ||
| // finished when the new one is confirmed. If this fails, the job reconciler's | ||
| // EnsureWorkloadSlices detects both slices admitted and finishes the old one. | ||
| func (s *Scheduler) replaceOldWorkloadSlice(ctx context.Context, log logr.Logger, e *entry, oldWorkloadSlice *preemption.Target) { |
There was a problem hiding this comment.
Nit: After the refactoring, replaceOldWorkloadSlice no longer appears to have meaningful control-flow value beyond local error reporting.
Would it make sense to make it fire-and-forget and move the error logging into the processEntry call site instead?
Something along the lines of:
if err := s.admit(ctx, e, cq); err != nil {
e.inadmissibleMsg = fmt.Sprintf("Failed to admit workload: %v", err)
return
}
if features.Enabled(features.ElasticJobsViaWorkloadSlices) && oldWorkloadSlice != nil {
if err := s.replaceOldWorkloadSlice(...); err != nil {
log.Error(err, "Failed to finish old workload slice after admitting replacement")
}
}This would keep admit generic and avoid exposing it to workload-slice replacement semantics for the common non-elastic workload path.
There was a problem hiding this comment.
Same reason as above — since this runs inside the async goroutine, there's no synchronous call site to bubble the error up to. Logging inside the function is the only option here.
| util.ExpectWorkloadsToBeAdmitted(ctx, k8sClient, oldWorkloadSlice) | ||
| util.ExpectJobUnsuspendedWithNodeSelectors(ctx, k8sClient, client.ObjectKeyFromObject(elasticJob), nil) | ||
|
|
||
| ginkgo.By("scaling the job beyond the remaining quota") |
There was a problem hiding this comment.
IFIUC If the newWorkload is beyond the remaining quota the newWorkload would not reserve the quota, and so Kueue won't try to admit the newWorkload and return in line 401.
So I don't think the added changes are tested here.
There was a problem hiding this comment.
The scheduler test framework can only fail all workload patches at once, not just the new slice's, so we can't simulate a selective admission failure here. The scheduler fix is structural: replaceOldWorkloadSlice only fires after PatchAdmissionStatus succeeds. This integration test covers the other side, making sure the job reconciler doesn't finish the old slice while the replacement is still pending.
There was a problem hiding this comment.
This integration test covers the other side, making sure the job reconciler doesn't finish the old slice while the replacement is still pending.
I don't think it's actually tested here, because the replacement doesn't happening here since the new workload is beyond the quota.
Additionally, the test passes when run without the changes.
There was a problem hiding this comment.
Thank you @yaroslava-serdiuk for spotting that. I think this test is wrong indeed. If the test is not testing the behavior change (passes before and after the change), then let's drop it.
We may consider a test which explicitly tests the invariant that OldSlice it not finished when transitioning the new slice to Admitted. This can be done with the watch pattern.
|
/release-note-edit |
|
/lgtm I reviewed and checked the discussions and I think we area good to merge as is. |
|
@mimowo: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
LGTM label has been added. DetailsGit tree hash: c282eb8306394e05439919f061fca9ee407be220 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ichekrygin, mimowo, sohankunkerkar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@mimowo: #11195 failed to apply on top of branch "release-0.17": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@mimowo: #11195 failed to apply on top of branch "release-0.16": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #9015
Special notes for your reviewer:
This implements the "admit first, finish old after" approach discussed by @ichekrygin and @mimowo in #9015. The old slice is now only finished inside the admission success path, so if admission fails the old slice retains its quota and the job keeps running. There is a brief window where both slices are in the cache (old not yet finished, new just admitted). This is conservative over-counting (blocks other admits, never allows over-commitment) and resolves within one API round-trip.
Does this PR introduce a user-facing change?