[Fix] Prevent remote wl creation with stale ClusterName#11378
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
/cc @vladikkuzn @reruno |
e9e5bf8 to
39cc9a5
Compare
39cc9a5 to
bb4cbb4
Compare
| } | ||
| if group.local.Status.ClusterName == nil && !equality.Semantic.DeepEqual(group.local.Status.NominatedClusterNames, nominatedWorkers) { | ||
|
|
||
| if !equality.Semantic.DeepEqual(group.local.Status.NominatedClusterNames, nominatedWorkers) { |
There was a problem hiding this comment.
I have some questions:
0. could you describe what are the state transitions for the happy vs. unhappy test executions so that we can better understand the scenario (that can help to understand the follow up questions)
- Is this fix only needed for the AllAtOnce dispatcher, or the issue exists for all of them? I'm basically wondering if the fix should be inside the "if" for the dispatcher type, or more generic, say before that detecting "Eviction ongoing".
- Is this state safe-healing by follow up requests, or the system gets stuck in a wrong state? It seems very fragile that proceeding with the patch request messes up the system state. I would expect the ongoing eviction can fix the system state by resetting status.ClusterName and status.NominatedClusterNames to null
There was a problem hiding this comment.
I left the analysis in the issue - #11062 (comment)
Just copying it here too so we can talk about it.
1. Stale cache: ClusterName = "worker1", NominatedClusterNames = nil
2. Nomination guard skipped - worker2 created unconditionally
3. Worker2 admitted → syncReservingRemoteState tries SSA: ClusterName = "worker2", NominatedClusterNames = nil
4. Webhook: oldObj.NominatedClusterNames = nil → doesn't contain "worker2" → rejected forever
- The issue is bounded to
AllAtOnceand that's because it's coupled with the controller.
The controller both decides on nomination and immediately uses it to create remote workloads.
If a controller only updatesNominatedClusterNamesand does not create remote workloads in the same reconcile, then the stale ClusterName race cannot directly produce the bad side effect.
The worst outcome could be that the nomination update is skipped or delayed because the cache still shows ClusterName != nil.
That is what the incremental dispatcher does, it bails out when ClusterName is set, and its only status write is the nomination patch in the later nomination path.
- This is designed to self-heal and not get stuck.
Eviction/reservation eventually clearsClusterNameandNominatedClusterNames, and reconcile waits for informer cache to reflect that before patching nominations or creating remote workloads.
The risky window was stale cache.
If you patch whileClusterNamestill appears set, you can act on old state and create wrong remotes.
The early return prevents that until the next reconcile after cache catch-up.
For Incremental or External nomination-only controllers, ClusterName != nil is already a conservative stop condition.
There was a problem hiding this comment.
That means Moving AllAtOnce to separate controller should be safer in those terms.
There was a problem hiding this comment.
I'm not sure I follow the steps or the meaning of the "stale" here, but I suppose there is something to it, that the AllAtOnce dispatcher may be going directly for workload selection before the eviction elapses.
The eviction is triggered by setting status.admissionChecks.state=Retry, and it makes the workload controller to set status.clusterName=nil and status.nominatedClusterName=nil. Only at this point it is safe to start another round of nomination.
So the fix should be ok by making sure that status.clusterName=nil before proceeding, but this is pretty indirect verification. A more direct approach would be to check if status.admissionChecks.state is Pending (no longer Retry), as the incremental dispatcher is doing: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/workloaddispatcher/incrementaldispatcher.go#L103-L106
Moreover, I think we should check this condition for safety independently of the dispatcher, and just defer the nomination phase.
So, I think the PR is fixing the issue, but in a bit indirect way, I'm running the test 200 times in a loop to confirm:
- without the fix as "control": WIP: Experiment1 for https://github.com/kubernetes-sigs/kueue/pull/11378 #11401
- with the fix as "test": WIP: Experiment2 for https://github.com/kubernetes-sigs/kueue/pull/11378 #11402
I also suspect sometimes we may be getting unnecessary updates to the list of nominatedClusternames from the AllAtOnce dispatcher due to map-based ordering of the clusters. This is probably another issue, but also testing here: #11403
I'm pretty much ok with this fix as is. It will be refactored anyway as part of #10937
There was a problem hiding this comment.
Ok, I couldn't repro the flake despire 200 attempts, but I think the analysis makes sense. Certainly the check is harmless as we shouldn't run nomination while a clusterName is already assigned.
We could just work on a more generic solution, but I think we can leave it for a follow up.
|
cc @olekzabl ptal |
|
Thank you for fixing the flake, and a user-facing issue at the same time. I still believe there exists a more generic solution by just skipping nomination for all dispatchers, but let's consider this a follow up cleanup. I will open an issue. /lgtm |
|
@mimowo: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
LGTM label has been added. DetailsGit tree hash: 19f95325e933e053c7c4f54515def611fab2c36b |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, mszadkow The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@mimowo: new pull request created: #11461 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@mimowo: new pull request created: #11462 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/release-note-edit |
What type of PR is this?
/kind bug
/area multikueue
What this PR does / why we need it:
Fixed a race condition in the AllAtOnce MultiKueue dispatcher where a stale informer cache could cause remote workloads to be created before nomination was confirmed in etcd, leading to an infinite webhook rejection loop that prevented re-admission after worker eviction.
Which issue(s) this PR fixes:
Fixes #11062
Special notes for your reviewer:
Does this PR introduce a user-facing change?