Skip to content

[Fix] Prevent remote wl creation with stale ClusterName#11378

Merged
k8s-ci-robot merged 1 commit into
kubernetes-sigs:mainfrom
epam:flake/11062-mk-preemption-stuck
May 22, 2026
Merged

[Fix] Prevent remote wl creation with stale ClusterName#11378
k8s-ci-robot merged 1 commit into
kubernetes-sigs:mainfrom
epam:flake/11062-mk-preemption-stuck

Conversation

@mszadkow
Copy link
Copy Markdown
Contributor

@mszadkow mszadkow commented May 21, 2026

What type of PR is this?

/kind bug
/area multikueue

What this PR does / why we need it:

Fixed a race condition in the AllAtOnce MultiKueue dispatcher where a stale informer cache could cause remote workloads to be created before nomination was confirmed in etcd, leading to an infinite webhook rejection loop that prevented re-admission after worker eviction.

Which issue(s) this PR fixes:

Fixes #11062

Special notes for your reviewer:

Does this PR introduce a user-facing change?

MultiKueue: Fixed a bug in the AllAtOnce dispatcher where workloads evicted from a
worker cluster could fail to be re-admitted. Kueue now waits for the ongoing eviction to
complete before starting a new nomination and re-admission cycle.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. area/multikueue Issues or PRs related to MultiKueue labels May 21, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 21, 2026

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit bb4cbb4
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a0ed04915b313000807ad0d

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 21, 2026
@mszadkow
Copy link
Copy Markdown
Contributor Author

/cc @vladikkuzn @reruno

@k8s-ci-robot k8s-ci-robot requested review from reruno and vladikkuzn May 21, 2026 08:34
@mszadkow mszadkow force-pushed the flake/11062-mk-preemption-stuck branch from e9e5bf8 to 39cc9a5 Compare May 21, 2026 08:42
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 21, 2026
@mszadkow mszadkow force-pushed the flake/11062-mk-preemption-stuck branch from 39cc9a5 to bb4cbb4 Compare May 21, 2026 09:28
}
if group.local.Status.ClusterName == nil && !equality.Semantic.DeepEqual(group.local.Status.NominatedClusterNames, nominatedWorkers) {

if !equality.Semantic.DeepEqual(group.local.Status.NominatedClusterNames, nominatedWorkers) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions:
0. could you describe what are the state transitions for the happy vs. unhappy test executions so that we can better understand the scenario (that can help to understand the follow up questions)

  1. Is this fix only needed for the AllAtOnce dispatcher, or the issue exists for all of them? I'm basically wondering if the fix should be inside the "if" for the dispatcher type, or more generic, say before that detecting "Eviction ongoing".
  2. Is this state safe-healing by follow up requests, or the system gets stuck in a wrong state? It seems very fragile that proceeding with the patch request messes up the system state. I would expect the ongoing eviction can fix the system state by resetting status.ClusterName and status.NominatedClusterNames to null

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left the analysis in the issue - #11062 (comment)
Just copying it here too so we can talk about it.

1. Stale cache: ClusterName = "worker1", NominatedClusterNames = nil
2. Nomination guard skipped - worker2 created unconditionally
3. Worker2 admitted → syncReservingRemoteState tries SSA: ClusterName = "worker2", NominatedClusterNames = nil
4. Webhook: oldObj.NominatedClusterNames = nil → doesn't contain "worker2" → rejected forever
  1. The issue is bounded to AllAtOnce and that's because it's coupled with the controller.
    The controller both decides on nomination and immediately uses it to create remote workloads.
    If a controller only updates NominatedClusterNames and does not create remote workloads in the same reconcile, then the stale ClusterName race cannot directly produce the bad side effect.

The worst outcome could be that the nomination update is skipped or delayed because the cache still shows ClusterName != nil.
That is what the incremental dispatcher does, it bails out when ClusterName is set, and its only status write is the nomination patch in the later nomination path.

  1. This is designed to self-heal and not get stuck.
    Eviction/reservation eventually clears ClusterName and NominatedClusterNames, and reconcile waits for informer cache to reflect that before patching nominations or creating remote workloads.
    The risky window was stale cache.
    If you patch while ClusterName still appears set, you can act on old state and create wrong remotes.
    The early return prevents that until the next reconcile after cache catch-up.

For Incremental or External nomination-only controllers, ClusterName != nil is already a conservative stop condition.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means Moving AllAtOnce to separate controller should be safer in those terms.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow the steps or the meaning of the "stale" here, but I suppose there is something to it, that the AllAtOnce dispatcher may be going directly for workload selection before the eviction elapses.

The eviction is triggered by setting status.admissionChecks.state=Retry, and it makes the workload controller to set status.clusterName=nil and status.nominatedClusterName=nil. Only at this point it is safe to start another round of nomination.

So the fix should be ok by making sure that status.clusterName=nil before proceeding, but this is pretty indirect verification. A more direct approach would be to check if status.admissionChecks.state is Pending (no longer Retry), as the incremental dispatcher is doing: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/workloaddispatcher/incrementaldispatcher.go#L103-L106

Moreover, I think we should check this condition for safety independently of the dispatcher, and just defer the nomination phase.

So, I think the PR is fixing the issue, but in a bit indirect way, I'm running the test 200 times in a loop to confirm:

  1. without the fix as "control": WIP: Experiment1 for https://github.com/kubernetes-sigs/kueue/pull/11378 #11401
  2. with the fix as "test": WIP: Experiment2 for https://github.com/kubernetes-sigs/kueue/pull/11378 #11402

I also suspect sometimes we may be getting unnecessary updates to the list of nominatedClusternames from the AllAtOnce dispatcher due to map-based ordering of the clusters. This is probably another issue, but also testing here: #11403

I'm pretty much ok with this fix as is. It will be refactored anyway as part of #10937

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I couldn't repro the flake despire 200 attempts, but I think the analysis makes sense. Certainly the check is harmless as we shouldn't run nomination while a clusterName is already assigned.

We could just work on a more generic solution, but I think we can leave it for a follow up.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened: #11452

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 21, 2026

cc @olekzabl ptal

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 22, 2026

Thank you for fixing the flake, and a user-facing issue at the same time. I still believe there exists a more generic solution by just skipping nomination for all dispatchers, but let's consider this a follow up cleanup. I will open an issue.

/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

@mimowo: once the present PR merges, I will cherry-pick it on top of release-0.16, release-0.17 in new PRs and assign them to you.

Details

In response to this:

Thank you for fixing the flake, and a user-facing issue at the same time. I still believe there exists a more generic solution by just skipping nomination for all dispatchers, but let's consider this a follow up cleanup. I will open an issue.

/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 22, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 19f95325e933e053c7c4f54515def611fab2c36b

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, mszadkow

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 22, 2026
@k8s-ci-robot k8s-ci-robot merged commit 7cb7df0 into kubernetes-sigs:main May 22, 2026
39 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.18 milestone May 22, 2026
@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

@mimowo: new pull request created: #11461

Details

In response to this:

Thank you for fixing the flake, and a user-facing issue at the same time. I still believe there exists a more generic solution by just skipping nomination for all dispatchers, but let's consider this a follow up cleanup. I will open an issue.

/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

@mimowo: new pull request created: #11462

Details

In response to this:

Thank you for fixing the flake, and a user-facing issue at the same time. I still believe there exists a more generic solution by just skipping nomination for all dispatchers, but let's consider this a follow up cleanup. I will open an issue.

/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 22, 2026

/release-note-edit

MultiKueue: Fixed a bug in the AllAtOnce dispatcher where workloads evicted from a
worker cluster could fail to be re-admitted. Kueue now waits for the ongoing eviction to
complete before starting a new nomination and re-admission cycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/multikueue Issues or PRs related to MultiKueue cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MultiKueue when Preemption with a multikueue admission check Should re-do admission process when workload gets evicted in the worker

4 participants