Finish old workload slice only after new slice is admitted by sohankunkerkar · Pull Request #11195 · kubernetes-sigs/kueue

sohankunkerkar · 2026-05-14T14:37:59Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

This implements the "admit first, finish old after" approach discussed by @ichekrygin and @mimowo in #9015. The old slice is now only finished inside the admission success path, so if admission fails the old slice retains its quota and the job keeps running. There is a brief window where both slices are in the cache (old not yet finished, new just admitted). This is conservative over-counting (blocks other admits, never allows over-commitment) and resolves within one API round-trip.

Does this PR introduce a user-facing change?

ElasticJobViaWorkloadSlices: Fix quota leak during elastic workload scale-up where old slice was finished before replacement slice was admitted.

netlify · 2026-05-14T14:38:06Z

✅ Deploy Preview for kubernetes-sigs-kueue ready!

Name	Link
🔨 Latest commit	`a50091f`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a05de4afc30890008c6594e
😎 Deploy Preview	https://deploy-preview-11195--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copilot

Pull request overview

This PR fixes an elastic workload slice scale-up edge case by ensuring the old admitted WorkloadSlice is only finished after the replacement slice successfully reserves quota, preventing a “quota leak” scenario where the Job keeps running unsuspended with no admitted slice.

Changes:

Move old-slice finishing into the scheduler admission success path (admit first, finish old after).
Preserve MultiKueue placement by copying Status.ClusterName from the old slice onto the replacement before admission.
Add/adjust tests to validate the pending-replacement behavior and the resulting event ordering.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
test/integration/singlecluster/controller/jobs/job/job_controller_test.go	Adds an integration test asserting the old slice stays admitted/unfinished while the replacement slice remains pending.
pkg/workloadslicing/workloadslicing.go	Clarifies comment wording around explicit eviction causes for finishing an old slice.
pkg/scheduler/scheduler.go	Finishes old slice only after successful admission of the replacement; copies `ClusterName` pre-admission for MultiKueue continuity.
pkg/scheduler/scheduler_test.go	Updates expected event ordering to reflect finishing the old slice after admitting the new one.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mimowo · 2026-05-14T15:26:56Z

@yaroslava-serdiuk @ichekrygin ptal

mimowo · 2026-05-14T17:46:56Z

 	if err := s.replaceWorkloadSlice(ctx, oldWorkloadSlice.WorkloadInfo.ClusterQueue, e.Obj, oldWorkloadSlice.WorkloadInfo.Obj.DeepCopy()); err != nil {
-		log.Error(err, "Failed to replace workload slice")
-		return err
+		log.Error(err, "Failed to finish old workload slice after admitting replacement; job reconciler will handle recovery")


I think this is ok, but IIUC if this fails, then we may observe double counting of the quota for the brief moment - before the JobReconciler finishes the old slices.

The quota bump may result in temporary excessive preemptions of other workloads, or blocking workloads that could be admitted otherwise.

So, I think ideally if the scheduler cache was aware of the fact that both slices are admitted at the same time, and only count capacity from the "max" of them, rather than "sum".

However, that quota bump is very rare (requires request failure), and the consequence is also limited (excessive preemptions for a brief moment), so my comment is mostly to confirm my understanding rather than to request changes.

We may re-evaluate if this is good enough or not before GA graduation.

Yup, that's right. The double-count only happens if replaceOldWorkloadSlice fails after admission succeeds i.e. the old slice stays admitted in the cache alongside the new one until the job reconciler's EnsureWorkloadSlices finishes the old slice on its next reconcile. This is conservative (over-reports usage, blocks other admits temporarily) and resolves within one reconcile cycle. I agree it's fine for now, and the cache-level max-not-sum optimization is a good candidate for GA if it shows up in practice.

We had discussed earlier that it could be helpful to "teach" the queue/cache how to exclude the unfinished old slice from quota accounting, so we don't observe an artificial quota increase in the case where updating the old slice fails.

It would be great if we could address that in this PR, but I also think it is reasonable as a follow-up.

In that case, since you now have all the context around this flow 🙂, would you mind creating a follow-up issue and linking it to this PR?

How about I document this as a GA consideration in the ElasticJobsViaWorkloadSlices KEP instead? It's a pretty narrow failure path (API write failing right after a successful one) and @mimowo was also leaning toward revisiting at GA.

+1 to updating the KEP, could you do it in this PR, please?

Added here: #11242

ichekrygin

Changes are LGTM overall, aside from two small nits, nothing blocking from my side.

Thank you for updating this!

ichekrygin · 2026-05-15T00:33:31Z

 // assuming it in the cache.
 // Note: this does not necessarily make the workload "admitted".
-func (s *Scheduler) admit(ctx context.Context, e *entry, cq *schdcache.ClusterQueueSnapshot) error {
+func (s *Scheduler) admit(ctx context.Context, e *entry, cq *schdcache.ClusterQueueSnapshot, oldWorkloadSlice *preemption.Target) error {


Nit: I understand why replaceOldWorkloadSlice needs to happen only after successful admission, but is there a reason it has to be inside admit rather than immediately after a successful admit call?

For example, we could keep admit generic and do:

if err := s.admit(ctx, e, cq); err != nil { e.inadmissibleMsg = fmt.Sprintf("Failed to admit workload: %v", err) return } if features.Enabled(features.ElasticJobsViaWorkloadSlices) && oldWorkloadSlice != nil { s.replaceOldWorkloadSlice(ctx, log, e, oldWorkloadSlice) }

That would preserve the important ordering, finish the old slice only after the replacement was successfully admitted, while avoiding exposing admit to the workload-slice replacement concept. Most workloads are not elastic slice replacements, so passing oldWorkloadSlice through admit feels a bit leaky to me.

I think admit() is async. The admissionRoutineWrapper.Run() launches a goroutine and admit() returns nil immediately before PatchAdmissionStatus completes. If we finish the old slice after admit() returns in processEntry(), we'd be finishing it before admission is actually confirmed. It's the same race we're fixing. That's why it has to live inside the goroutine's success path.

ichekrygin · 2026-05-15T00:38:14Z

+// admitted. Called inside the admit success path so the old slice is only
+// finished when the new one is confirmed. If this fails, the job reconciler's
+// EnsureWorkloadSlices detects both slices admitted and finishes the old one.
+func (s *Scheduler) replaceOldWorkloadSlice(ctx context.Context, log logr.Logger, e *entry, oldWorkloadSlice *preemption.Target) {


Nit: After the refactoring, replaceOldWorkloadSlice no longer appears to have meaningful control-flow value beyond local error reporting.

Would it make sense to make it fire-and-forget and move the error logging into the processEntry call site instead?

Something along the lines of:

if err := s.admit(ctx, e, cq); err != nil { e.inadmissibleMsg = fmt.Sprintf("Failed to admit workload: %v", err) return } if features.Enabled(features.ElasticJobsViaWorkloadSlices) && oldWorkloadSlice != nil { if err := s.replaceOldWorkloadSlice(...); err != nil { log.Error(err, "Failed to finish old workload slice after admitting replacement") } }

This would keep admit generic and avoid exposing it to workload-slice replacement semantics for the common non-elastic workload path.

Same reason as above — since this runs inside the async goroutine, there's no synchronous call site to bubble the error up to. Logging inside the function is the only option here.

yaroslava-serdiuk · 2026-05-15T11:58:18Z

+		util.ExpectWorkloadsToBeAdmitted(ctx, k8sClient, oldWorkloadSlice)
+		util.ExpectJobUnsuspendedWithNodeSelectors(ctx, k8sClient, client.ObjectKeyFromObject(elasticJob), nil)
+
+		ginkgo.By("scaling the job beyond the remaining quota")


IFIUC If the newWorkload is beyond the remaining quota the newWorkload would not reserve the quota, and so Kueue won't try to admit the newWorkload and return in line 401.
So I don't think the added changes are tested here.

The scheduler test framework can only fail all workload patches at once, not just the new slice's, so we can't simulate a selective admission failure here. The scheduler fix is structural: replaceOldWorkloadSlice only fires after PatchAdmissionStatus succeeds. This integration test covers the other side, making sure the job reconciler doesn't finish the old slice while the replacement is still pending.

This integration test covers the other side, making sure the job reconciler doesn't finish the old slice while the replacement is still pending.

I don't think it's actually tested here, because the replacement doesn't happening here since the new workload is beyond the quota.
Additionally, the test passes when run without the changes.

Thank you @yaroslava-serdiuk for spotting that. I think this test is wrong indeed. If the test is not testing the behavior change (passes before and after the change), then let's drop it.

We may consider a test which explicitly tests the invariant that OldSlice it not finished when transitioning the new slice to Admitted. This can be done with the watch pattern.

Opened the follow up: #11283

mimowo · 2026-05-18T09:06:01Z

/release-note-edit

ElasticJobViaWorkloadSlices: Fix quota leak during elastic workload scale-up where old slice was finished before replacement slice was admitted.

mimowo · 2026-05-18T09:07:28Z

/lgtm
/approve
I think this is a step in the right direction. I will also be good for ElasticJobs + TAS integration where we need to be sure we can copy the assignment from the "old" slice.

I reviewed and checked the discussions and I think we area good to merge as is.
/cherrypick release-0.17
/cherrypick release-0.16
Thank you 👍

k8s-infra-cherrypick-robot · 2026-05-18T09:07:32Z

@mimowo: once the present PR merges, I will cherry-pick it on top of release-0.16, release-0.17 in new PRs and assign them to you.

Details

In response to this:

/lgtm
/approve
I think this is a step in the right direction. I will also be good for ElasticJobs + TAS integration where we need to be sure we can copy the assignment from the "old" slice.

I reviewed and checked the discussions and I think we area good to merge as is.
/cherrypick release-0.17
/cherrypick release-0.16
Thank you 👍

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-05-18T09:07:38Z

LGTM label has been added.

Details

Git tree hash: c282eb8306394e05439919f061fca9ee407be220

k8s-ci-robot · 2026-05-18T09:07:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ichekrygin, mimowo, sohankunkerkar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [mimowo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-infra-cherrypick-robot · 2026-05-18T09:40:31Z

@mimowo: #11195 failed to apply on top of branch "release-0.17":

Applying: Finish old workload slice only after new slice is admitted
Using index info to reconstruct a base tree...
M	pkg/scheduler/scheduler.go
M	pkg/scheduler/scheduler_test.go
M	pkg/workloadslicing/workloadslicing.go
M	test/integration/singlecluster/controller/jobs/job/job_controller_test.go
Falling back to patching base and 3-way merge...
Auto-merging test/integration/singlecluster/controller/jobs/job/job_controller_test.go
Auto-merging pkg/workloadslicing/workloadslicing.go
Auto-merging pkg/scheduler/scheduler_test.go
Auto-merging pkg/scheduler/scheduler.go
CONFLICT (content): Merge conflict in pkg/scheduler/scheduler.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 Finish old workload slice only after new slice is admitted

Details

In response to this:

/lgtm
/approve
I think this is a step in the right direction. I will also be good for ElasticJobs + TAS integration where we need to be sure we can copy the assignment from the "old" slice.

I reviewed and checked the discussions and I think we area good to merge as is.
/cherrypick release-0.17
/cherrypick release-0.16
Thank you 👍

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-infra-cherrypick-robot · 2026-05-18T09:41:06Z

@mimowo: #11195 failed to apply on top of branch "release-0.16":

Applying: Finish old workload slice only after new slice is admitted
Using index info to reconstruct a base tree...
M	pkg/scheduler/scheduler.go
M	pkg/scheduler/scheduler_test.go
M	pkg/workloadslicing/workloadslicing.go
M	test/integration/singlecluster/controller/jobs/job/job_controller_test.go
Falling back to patching base and 3-way merge...
Auto-merging test/integration/singlecluster/controller/jobs/job/job_controller_test.go
Auto-merging pkg/workloadslicing/workloadslicing.go
Auto-merging pkg/scheduler/scheduler_test.go
Auto-merging pkg/scheduler/scheduler.go
CONFLICT (content): Merge conflict in pkg/scheduler/scheduler.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 Finish old workload slice only after new slice is admitted

Details

In response to this:

/lgtm
/approve
I think this is a step in the right direction. I will also be good for ElasticJobs + TAS integration where we need to be sure we can copy the assignment from the "old" slice.

I reviewed and checked the discussions and I think we area good to merge as is.
/cherrypick release-0.17
/cherrypick release-0.16
Thank you 👍

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Finish old workload slice only after new slice is admitted

a50091f

Copilot AI review requested due to automatic review settings May 14, 2026 14:38

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels May 14, 2026

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 14, 2026

k8s-ci-robot requested review from PBundyra and kannon92 May 14, 2026 14:38

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 14, 2026

Copilot started reviewing on behalf of sohankunkerkar May 14, 2026 14:38 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

mimowo reviewed May 14, 2026

View reviewed changes

ichekrygin approved these changes May 15, 2026

View reviewed changes

yaroslava-serdiuk reviewed May 15, 2026

View reviewed changes

k8s-ci-robot assigned mimowo May 18, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 18, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2026

k8s-ci-robot merged commit 70894d0 into kubernetes-sigs:main May 18, 2026
43 of 44 checks passed

k8s-ci-robot added this to the v0.18 milestone May 18, 2026

mimowo mentioned this pull request May 18, 2026

Fix the integration test for the replacement contract for WorkloadSlices #11283

Closed

Conversation

sohankunkerkar commented May 14, 2026 • edited by k8s-ci-robot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

netlify Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-kueue ready!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

mimowo commented May 14, 2026

Uh oh!

mimowo May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ichekrygin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaroslava-serdiuk May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mimowo commented May 18, 2026

Uh oh!

mimowo commented May 18, 2026

Uh oh!

k8s-infra-cherrypick-robot commented May 18, 2026

Uh oh!

k8s-ci-robot commented May 18, 2026

Uh oh!

k8s-ci-robot commented May 18, 2026

Uh oh!

Uh oh!

k8s-infra-cherrypick-robot commented May 18, 2026

Uh oh!

k8s-infra-cherrypick-robot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

sohankunkerkar commented May 14, 2026 •

edited by k8s-ci-robot

Loading

netlify Bot commented May 14, 2026 •

edited

Loading

mimowo May 14, 2026 •

edited

Loading

yaroslava-serdiuk May 18, 2026 •

edited

Loading