Skip to content

Finish old workload slice only after new slice is admitted#11195

Merged
k8s-ci-robot merged 1 commit into
kubernetes-sigs:mainfrom
sohankunkerkar:fix-elastic-slice-replacement-quota-gap
May 18, 2026
Merged

Finish old workload slice only after new slice is admitted#11195
k8s-ci-robot merged 1 commit into
kubernetes-sigs:mainfrom
sohankunkerkar:fix-elastic-slice-replacement-quota-gap

Conversation

@sohankunkerkar
Copy link
Copy Markdown
Member

@sohankunkerkar sohankunkerkar commented May 14, 2026

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #9015

Special notes for your reviewer:

This implements the "admit first, finish old after" approach discussed by @ichekrygin and @mimowo in #9015. The old slice is now only finished inside the admission success path, so if admission fails the old slice retains its quota and the job keeps running. There is a brief window where both slices are in the cache (old not yet finished, new just admitted). This is conservative over-counting (blocks other admits, never allows over-commitment) and resolves within one API round-trip.

Does this PR introduce a user-facing change?

ElasticJobViaWorkloadSlices: Fix quota leak during elastic workload scale-up where old slice was finished before replacement slice was admitted.

Copilot AI review requested due to automatic review settings May 14, 2026 14:38
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels May 14, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 14, 2026

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit a50091f
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a05de4afc30890008c6594e
😎 Deploy Preview https://deploy-preview-11195--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 14, 2026
@k8s-ci-robot k8s-ci-robot requested review from PBundyra and kannon92 May 14, 2026 14:38
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 14, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an elastic workload slice scale-up edge case by ensuring the old admitted WorkloadSlice is only finished after the replacement slice successfully reserves quota, preventing a “quota leak” scenario where the Job keeps running unsuspended with no admitted slice.

Changes:

  • Move old-slice finishing into the scheduler admission success path (admit first, finish old after).
  • Preserve MultiKueue placement by copying Status.ClusterName from the old slice onto the replacement before admission.
  • Add/adjust tests to validate the pending-replacement behavior and the resulting event ordering.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
test/integration/singlecluster/controller/jobs/job/job_controller_test.go Adds an integration test asserting the old slice stays admitted/unfinished while the replacement slice remains pending.
pkg/workloadslicing/workloadslicing.go Clarifies comment wording around explicit eviction causes for finishing an old slice.
pkg/scheduler/scheduler.go Finishes old slice only after successful admission of the replacement; copies ClusterName pre-admission for MultiKueue continuity.
pkg/scheduler/scheduler_test.go Updates expected event ordering to reflect finishing the old slice after admitting the new one.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 14, 2026

@yaroslava-serdiuk @ichekrygin ptal

if err := s.replaceWorkloadSlice(ctx, oldWorkloadSlice.WorkloadInfo.ClusterQueue, e.Obj, oldWorkloadSlice.WorkloadInfo.Obj.DeepCopy()); err != nil {
log.Error(err, "Failed to replace workload slice")
return err
log.Error(err, "Failed to finish old workload slice after admitting replacement; job reconciler will handle recovery")
Copy link
Copy Markdown
Contributor

@mimowo mimowo May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok, but IIUC if this fails, then we may observe double counting of the quota for the brief moment - before the JobReconciler finishes the old slices.

The quota bump may result in temporary excessive preemptions of other workloads, or blocking workloads that could be admitted otherwise.

So, I think ideally if the scheduler cache was aware of the fact that both slices are admitted at the same time, and only count capacity from the "max" of them, rather than "sum".

However, that quota bump is very rare (requires request failure), and the consequence is also limited (excessive preemptions for a brief moment), so my comment is mostly to confirm my understanding rather than to request changes.

We may re-evaluate if this is good enough or not before GA graduation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's right. The double-count only happens if replaceOldWorkloadSlice fails after admission succeeds i.e. the old slice stays admitted in the cache alongside the new one until the job reconciler's EnsureWorkloadSlices finishes the old slice on its next reconcile. This is conservative (over-reports usage, blocks other admits temporarily) and resolves within one reconcile cycle. I agree it's fine for now, and the cache-level max-not-sum optimization is a good candidate for GA if it shows up in practice.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had discussed earlier that it could be helpful to "teach" the queue/cache how to exclude the unfinished old slice from quota accounting, so we don't observe an artificial quota increase in the case where updating the old slice fails.

It would be great if we could address that in this PR, but I also think it is reasonable as a follow-up.

In that case, since you now have all the context around this flow 🙂, would you mind creating a follow-up issue and linking it to this PR?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about I document this as a GA consideration in the ElasticJobsViaWorkloadSlices KEP instead? It's a pretty narrow failure path (API write failing right after a successful one) and @mimowo was also leaning toward revisiting at GA.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to updating the KEP, could you do it in this PR, please?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added here: #11242

Copy link
Copy Markdown
Contributor

@ichekrygin ichekrygin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are LGTM overall, aside from two small nits, nothing blocking from my side.

Thank you for updating this!

// assuming it in the cache.
// Note: this does not necessarily make the workload "admitted".
func (s *Scheduler) admit(ctx context.Context, e *entry, cq *schdcache.ClusterQueueSnapshot) error {
func (s *Scheduler) admit(ctx context.Context, e *entry, cq *schdcache.ClusterQueueSnapshot, oldWorkloadSlice *preemption.Target) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I understand why replaceOldWorkloadSlice needs to happen only after successful admission, but is there a reason it has to be inside admit rather than immediately after a successful admit call?

For example, we could keep admit generic and do:

if err := s.admit(ctx, e, cq); err != nil {
    e.inadmissibleMsg = fmt.Sprintf("Failed to admit workload: %v", err)
    return
}

if features.Enabled(features.ElasticJobsViaWorkloadSlices) && oldWorkloadSlice != nil {
    s.replaceOldWorkloadSlice(ctx, log, e, oldWorkloadSlice)
}

That would preserve the important ordering, finish the old slice only after the replacement was successfully admitted, while avoiding exposing admit to the workload-slice replacement concept. Most workloads are not elastic slice replacements, so passing oldWorkloadSlice through admit feels a bit leaky to me.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think admit() is async. The admissionRoutineWrapper.Run() launches a goroutine and admit() returns nil immediately before PatchAdmissionStatus completes. If we finish the old slice after admit() returns in processEntry(), we'd be finishing it before admission is actually confirmed. It's the same race we're fixing. That's why it has to live inside the goroutine's success path.

// admitted. Called inside the admit success path so the old slice is only
// finished when the new one is confirmed. If this fails, the job reconciler's
// EnsureWorkloadSlices detects both slices admitted and finishes the old one.
func (s *Scheduler) replaceOldWorkloadSlice(ctx context.Context, log logr.Logger, e *entry, oldWorkloadSlice *preemption.Target) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: After the refactoring, replaceOldWorkloadSlice no longer appears to have meaningful control-flow value beyond local error reporting.

Would it make sense to make it fire-and-forget and move the error logging into the processEntry call site instead?

Something along the lines of:

if err := s.admit(ctx, e, cq); err != nil {
    e.inadmissibleMsg = fmt.Sprintf("Failed to admit workload: %v", err)
    return
}

if features.Enabled(features.ElasticJobsViaWorkloadSlices) && oldWorkloadSlice != nil {
    if err := s.replaceOldWorkloadSlice(...); err != nil {
        log.Error(err, "Failed to finish old workload slice after admitting replacement")
    }
}

This would keep admit generic and avoid exposing it to workload-slice replacement semantics for the common non-elastic workload path.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason as above — since this runs inside the async goroutine, there's no synchronous call site to bubble the error up to. Logging inside the function is the only option here.

util.ExpectWorkloadsToBeAdmitted(ctx, k8sClient, oldWorkloadSlice)
util.ExpectJobUnsuspendedWithNodeSelectors(ctx, k8sClient, client.ObjectKeyFromObject(elasticJob), nil)

ginkgo.By("scaling the job beyond the remaining quota")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IFIUC If the newWorkload is beyond the remaining quota the newWorkload would not reserve the quota, and so Kueue won't try to admit the newWorkload and return in line 401.
So I don't think the added changes are tested here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler test framework can only fail all workload patches at once, not just the new slice's, so we can't simulate a selective admission failure here. The scheduler fix is structural: replaceOldWorkloadSlice only fires after PatchAdmissionStatus succeeds. This integration test covers the other side, making sure the job reconciler doesn't finish the old slice while the replacement is still pending.

Copy link
Copy Markdown
Contributor

@yaroslava-serdiuk yaroslava-serdiuk May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This integration test covers the other side, making sure the job reconciler doesn't finish the old slice while the replacement is still pending.

I don't think it's actually tested here, because the replacement doesn't happening here since the new workload is beyond the quota.
Additionally, the test passes when run without the changes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @yaroslava-serdiuk for spotting that. I think this test is wrong indeed. If the test is not testing the behavior change (passes before and after the change), then let's drop it.

We may consider a test which explicitly tests the invariant that OldSlice it not finished when transitioning the new slice to Admitted. This can be done with the watch pattern.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened the follow up: #11283

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 18, 2026

/release-note-edit

ElasticJobViaWorkloadSlices: Fix quota leak during elastic workload scale-up where old slice was finished before replacement slice was admitted.

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 18, 2026

/lgtm
/approve
I think this is a step in the right direction. I will also be good for ElasticJobs + TAS integration where we need to be sure we can copy the assignment from the "old" slice.

I reviewed and checked the discussions and I think we area good to merge as is.
/cherrypick release-0.17
/cherrypick release-0.16
Thank you 👍

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

@mimowo: once the present PR merges, I will cherry-pick it on top of release-0.16, release-0.17 in new PRs and assign them to you.

Details

In response to this:

/lgtm
/approve
I think this is a step in the right direction. I will also be good for ElasticJobs + TAS integration where we need to be sure we can copy the assignment from the "old" slice.

I reviewed and checked the discussions and I think we area good to merge as is.
/cherrypick release-0.17
/cherrypick release-0.16
Thank you 👍

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 18, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: c282eb8306394e05439919f061fca9ee407be220

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ichekrygin, mimowo, sohankunkerkar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2026
@k8s-ci-robot k8s-ci-robot merged commit 70894d0 into kubernetes-sigs:main May 18, 2026
43 of 44 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.18 milestone May 18, 2026
@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

@mimowo: #11195 failed to apply on top of branch "release-0.17":

Applying: Finish old workload slice only after new slice is admitted
Using index info to reconstruct a base tree...
M	pkg/scheduler/scheduler.go
M	pkg/scheduler/scheduler_test.go
M	pkg/workloadslicing/workloadslicing.go
M	test/integration/singlecluster/controller/jobs/job/job_controller_test.go
Falling back to patching base and 3-way merge...
Auto-merging test/integration/singlecluster/controller/jobs/job/job_controller_test.go
Auto-merging pkg/workloadslicing/workloadslicing.go
Auto-merging pkg/scheduler/scheduler_test.go
Auto-merging pkg/scheduler/scheduler.go
CONFLICT (content): Merge conflict in pkg/scheduler/scheduler.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 Finish old workload slice only after new slice is admitted

Details

In response to this:

/lgtm
/approve
I think this is a step in the right direction. I will also be good for ElasticJobs + TAS integration where we need to be sure we can copy the assignment from the "old" slice.

I reviewed and checked the discussions and I think we area good to merge as is.
/cherrypick release-0.17
/cherrypick release-0.16
Thank you 👍

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

@mimowo: #11195 failed to apply on top of branch "release-0.16":

Applying: Finish old workload slice only after new slice is admitted
Using index info to reconstruct a base tree...
M	pkg/scheduler/scheduler.go
M	pkg/scheduler/scheduler_test.go
M	pkg/workloadslicing/workloadslicing.go
M	test/integration/singlecluster/controller/jobs/job/job_controller_test.go
Falling back to patching base and 3-way merge...
Auto-merging test/integration/singlecluster/controller/jobs/job/job_controller_test.go
Auto-merging pkg/workloadslicing/workloadslicing.go
Auto-merging pkg/scheduler/scheduler_test.go
Auto-merging pkg/scheduler/scheduler.go
CONFLICT (content): Merge conflict in pkg/scheduler/scheduler.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 Finish old workload slice only after new slice is admitted

Details

In response to this:

/lgtm
/approve
I think this is a step in the right direction. I will also be good for ElasticJobs + TAS integration where we need to be sure we can copy the assignment from the "old" slice.

I reviewed and checked the discussions and I think we area good to merge as is.
/cherrypick release-0.17
/cherrypick release-0.16
Thank you 👍

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Elastic Workloads: During scale-up, old WorkloadSlice is finished but new WorkloadSlice is not admitted, leaving Job active and unsuspended

7 participants