Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion pkg/controller/admissionchecks/multikueue/workload.go
Original file line number Diff line number Diff line change
Expand Up @@ -763,7 +763,14 @@ func (w *wlReconciler) nominateAndSynchronizeWorkers(ctx context.Context, group
for workerName := range group.remotes {
nominatedWorkers = append(nominatedWorkers, workerName)
}
if group.local.Status.ClusterName == nil && !equality.Semantic.DeepEqual(group.local.Status.NominatedClusterNames, nominatedWorkers) {

if !equality.Semantic.DeepEqual(group.local.Status.NominatedClusterNames, nominatedWorkers) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions:
0. could you describe what are the state transitions for the happy vs. unhappy test executions so that we can better understand the scenario (that can help to understand the follow up questions)

  1. Is this fix only needed for the AllAtOnce dispatcher, or the issue exists for all of them? I'm basically wondering if the fix should be inside the "if" for the dispatcher type, or more generic, say before that detecting "Eviction ongoing".
  2. Is this state safe-healing by follow up requests, or the system gets stuck in a wrong state? It seems very fragile that proceeding with the patch request messes up the system state. I would expect the ongoing eviction can fix the system state by resetting status.ClusterName and status.NominatedClusterNames to null

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left the analysis in the issue - #11062 (comment)
Just copying it here too so we can talk about it.

1. Stale cache: ClusterName = "worker1", NominatedClusterNames = nil
2. Nomination guard skipped - worker2 created unconditionally
3. Worker2 admitted → syncReservingRemoteState tries SSA: ClusterName = "worker2", NominatedClusterNames = nil
4. Webhook: oldObj.NominatedClusterNames = nil → doesn't contain "worker2" → rejected forever
  1. The issue is bounded to AllAtOnce and that's because it's coupled with the controller.
    The controller both decides on nomination and immediately uses it to create remote workloads.
    If a controller only updates NominatedClusterNames and does not create remote workloads in the same reconcile, then the stale ClusterName race cannot directly produce the bad side effect.

The worst outcome could be that the nomination update is skipped or delayed because the cache still shows ClusterName != nil.
That is what the incremental dispatcher does, it bails out when ClusterName is set, and its only status write is the nomination patch in the later nomination path.

  1. This is designed to self-heal and not get stuck.
    Eviction/reservation eventually clears ClusterName and NominatedClusterNames, and reconcile waits for informer cache to reflect that before patching nominations or creating remote workloads.
    The risky window was stale cache.
    If you patch while ClusterName still appears set, you can act on old state and create wrong remotes.
    The early return prevents that until the next reconcile after cache catch-up.

For Incremental or External nomination-only controllers, ClusterName != nil is already a conservative stop condition.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means Moving AllAtOnce to separate controller should be safer in those terms.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow the steps or the meaning of the "stale" here, but I suppose there is something to it, that the AllAtOnce dispatcher may be going directly for workload selection before the eviction elapses.

The eviction is triggered by setting status.admissionChecks.state=Retry, and it makes the workload controller to set status.clusterName=nil and status.nominatedClusterName=nil. Only at this point it is safe to start another round of nomination.

So the fix should be ok by making sure that status.clusterName=nil before proceeding, but this is pretty indirect verification. A more direct approach would be to check if status.admissionChecks.state is Pending (no longer Retry), as the incremental dispatcher is doing: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/workloaddispatcher/incrementaldispatcher.go#L103-L106

Moreover, I think we should check this condition for safety independently of the dispatcher, and just defer the nomination phase.

So, I think the PR is fixing the issue, but in a bit indirect way, I'm running the test 200 times in a loop to confirm:

  1. without the fix as "control": WIP: Experiment1 for https://github.com/kubernetes-sigs/kueue/pull/11378 #11401
  2. with the fix as "test": WIP: Experiment2 for https://github.com/kubernetes-sigs/kueue/pull/11378 #11402

I also suspect sometimes we may be getting unnecessary updates to the list of nominatedClusternames from the AllAtOnce dispatcher due to map-based ordering of the clusters. This is probably another issue, but also testing here: #11403

I'm pretty much ok with this fix as is. It will be refactored anyway as part of #10937

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I couldn't repro the flake despire 200 attempts, but I think the analysis makes sense. Certainly the check is harmless as we shouldn't run nomination while a clusterName is already assigned.

We could just work on a more generic solution, but I think we can leave it for a follow up.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened: #11452

// ClusterName != nil indicates possibly stale cache (eviction just cleared ClusterName
// but the informer hasn't caught up yet). Avoid creating remote workloads without a
// confirmed nomination — wait for the cache to sync.
if group.local.Status.ClusterName != nil {
return reconcile.Result{}, nil
}
if err := workload.PatchAdmissionStatus(ctx, w.client, group.local, w.clock, func(wl *kueue.Workload) (bool, error) {
wl.Status.NominatedClusterNames = nominatedWorkers
return true, nil
Expand Down
9 changes: 9 additions & 0 deletions pkg/controller/admissionchecks/multikueue/workload_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1721,6 +1721,7 @@ func TestNominateAndSynchronizeWorkers_MoreCases(t *testing.T) {
dispatcherMode string
remotes map[string]*kueue.Workload
nominatedWorkers []string
localClusterName *string
cond *metav1.Condition
createErr error
wantCreated []string
Expand All @@ -1738,6 +1739,13 @@ func TestNominateAndSynchronizeWorkers_MoreCases(t *testing.T) {
remotes: map[string]*kueue.Workload{remoteNames[0]: {}, remoteNames[1]: {}},
wantCreated: nil,
},
{
name: "AllClusters: stale cache ClusterName set, nominations not confirmed — no remote workloads created",
dispatcherMode: config.MultiKueueDispatcherModeAllAtOnce,
remotes: map[string]*kueue.Workload{remoteNames[0]: nil, remoteNames[1]: nil},
localClusterName: new(remoteNames[0]),
wantCreated: nil,
},
// Incremental dispatcher tests were moved to a separate file.
{
name: "External controller: no nominated workers, nothing created",
Expand Down Expand Up @@ -1770,6 +1778,7 @@ func TestNominateAndSynchronizeWorkers_MoreCases(t *testing.T) {
Status: kueue.WorkloadStatus{
Conditions: make([]metav1.Condition, 0, 1),
NominatedClusterNames: tt.nominatedWorkers,
ClusterName: tt.localClusterName,
},
}

Expand Down