Move AllAtOnce MultiKueue dispatcher to a dedicated controller by andrewseif · Pull Request #10937 · kubernetes-sigs/kueue

andrewseif · 2026-05-04T22:17:47Z

What type of PR is this?

/kind cleanup
/area multikueue

What this PR does / why we need it:

Move AllAtOnce MultiKueue dispatcher to a dedicated controller

Which issue(s) this PR fixes:

Fixes #6803

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

netlify · 2026-05-04T22:17:53Z

✅ Deploy Preview for kubernetes-sigs-kueue ready!

Name	Link
🔨 Latest commit	`5ea862e`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a0ee9c7c210940008b07a7c
😎 Deploy Preview	https://deploy-preview-10937--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2026-05-04T22:17:58Z

Hi @andrewseif. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-05-04T22:17:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: andrewseif
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tenzen-y · 2026-05-05T05:40:07Z

/ok-to-test

mimowo

Thank you for the effort 👍

andrewseif · 2026-05-05T10:40:29Z

I had to add some logic to watch the config and clusters, as these were free in the inline version, but the logic itself is mostly similar

andrewseif · 2026-05-06T23:47:55Z

/retest

mimowo · 2026-05-07T19:14:08Z

cc @Singularity23x0 @olekzabl @kshalot ptal

mimowo · 2026-05-07T19:16:09Z

+			if workload.IsEvicted(remoteWl) {
+				log.V(3).Info("Preserving evicted remote workload to allow eviction-recovery sync", "remote", rem)
+				continue
+			}


Interesting does it mean we have a bug for the Increamental dispatcher which is already extracted? If this is the case maybe we could start by showing the bug, and fixing it here. It will also make the PRs more dijestable by the split.

I wouldn't say it was a bug, but there was a race window that was apparent, when I extracted the AllAtOnce MK dispatcher, ill make sure to post the findings for verification in wg-batch for more experienced eyes to verify my findings

I have trouble following the issue description in the comment here.

Mostly because, in the handling of Evicted condition which you mentioned (I suppose, here), the only explicit call to SyncJob is here, i.e. in the case of a manager-originating eviction, while your case seems to be the other one (the worker-originating eviction, dealt with here), given that you care whether the manager will notice.

This might be not yet contradictory; maybe you've found a longer path (a ping-pong across a few reconcilers?) leading to calling SyncJob also in the worker-originating case?

But anyway, my bottom line is:

+1 to documenting this as a separate issue #N

and then, instead of summarizing that issue in a comment here, I'd just leave a link to #N, because:

even a several-lines summary can be hard to follow (as I'm right now experiencing)

an even longer summary does not feel fit in this place

#N will act as a place where we can further discuss (while such comments are more "frozen")

I have sent you my investigation findings, maybe you can verify them, I believe moving the AllAtOnce dispatcher created this, and it should be part of the PR, as without it the program won't function properly.

I'm wondering if this is related to this problem which is also attempted to be fixed here: #11378

cc @mszadkow wdyt?

I have looked at what @mszadkow did, it's in the same test suite, I think they are both targeting the same race condition, and it was showing in two different tests.

edit: I think it should solve #11115 aswell?

I wouldn't say it was a bug, but there was a race window that was apparent, when I extracted the AllAtOnce MK dispatcher, ill make sure to post the findings for verification in wg-batch for more experienced eyes to verify my findings

@andrewseif actually most of "races" are bugs, so I would like to understand which test and how is failing. It very well might be that you have discovered a bug we should extract to a separate preparatory bugfix PR we should cherrypick.

In order to let us understand what is the race I would recommend that you temporarily revert (or comment out the code) so that we can see what is the failure, and analyze it. Then we can make an informed decision if this is a separate bugfix or part of this PR.

mimowo

cc @mszadkow who worked before on extrating / adding he incremental dispatcher. Ptal

mimowo · 2026-05-08T16:23:05Z

@andrewseif I 'm planning to conclude the review next week, overall it looks great, but another pair of eyes from MK experts (@olekzabl or @kshalot ) would be great. Thank you for the effort once again 👍

olekzabl

First of all, thank you for doing this!

Then, even before going into detailed comments, I'm feeling I should start from a "provoking" question:
The incremental dispatcher is currently being parametrized with step size (#10877).
Given that, what if we "implemented" the "externalized" AllAtOnce just as a special case of that?
standaloneAllAtOnce := incrementalDispatcher{stepSize: 1000000}

You could say that's too simplistic, wasting performance etc. Because incremental dispatcher contains some bits of logic that we don't need. Perhaps.

But even if so, I'd like to ask how much of them we can have in common. Maybe extracting some shared pieces. Or maybe a central shared entry point with injectable per-case callbacks. IDK yet.

I'm just intuitively afraid of nearly-duplicating ~200 lines of code which may then diverge without a good reason. (And, in my eyes, this "unjustified divergence" shows up already in this PR. See my detailed comments).

I haven't yet read everything but must pause now. Will come back later.

olekzabl · 2026-05-08T22:49:13Z

+			if workload.IsEvicted(remoteWl) {
+				log.V(3).Info("Preserving evicted remote workload to allow eviction-recovery sync", "remote", rem)
+				continue
+			}


I have trouble following the issue description in the comment here.

Mostly because, in the handling of Evicted condition which you mentioned (I suppose, here), the only explicit call to SyncJob is here, i.e. in the case of a manager-originating eviction, while your case seems to be the other one (the worker-originating eviction, dealt with here), given that you care whether the manager will notice.

This might be not yet contradictory; maybe you've found a longer path (a ping-pong across a few reconcilers?) leading to calling SyncJob also in the worker-originating case?

But anyway, my bottom line is:

+1 to documenting this as a separate issue #N

and then, instead of summarizing that issue in a comment here, I'd just leave a link to #N, because:

even a several-lines summary can be hard to follow (as I'm right now experiencing)

an even longer summary does not feel fit in this place

#N will act as a place where we can further discuss (while such comments are more "frozen")

olekzabl · 2026-05-08T23:01:12Z

+		return reconcile.Result{}, nil
+	}
+
+	// The workload is already assigned to a cluster, no need to nominate workers.


Nit: this comment feels not very useful, given that it's duplicated in the log text just below.
(Though I'm aware that it looks so also in incrementaldispatcher.go).

olekzabl · 2026-05-08T23:26:15Z

+// filterActiveClusters returns the subset of remoteClusters whose MultiKueueCluster
+// has the MultiKueueClusterActive condition set to True. Clusters that are missing
+// or not active are excluded so they are not nominated for workload placement.
+func (r *AllAtOnceDispatcherReconciler) filterActiveClusters(ctx context.Context, remoteClusters sets.Set[string]) (sets.Set[string], error) {
+	active := sets.New[string]()
+	for clusterName := range remoteClusters {
+		cluster := &kueue.MultiKueueCluster{}
+		if err := r.client.Get(ctx, types.NamespacedName{Name: clusterName}, cluster); err != nil {
+			if client.IgnoreNotFound(err) != nil {
+				return nil, err
+			}
+			// Missing cluster: skip.
+			continue
+		}
+		if apimeta.IsStatusConditionTrue(cluster.Status.Conditions, kueue.MultiKueueClusterActive) {
+			active.Insert(clusterName)
+		}
+	}
+	return active, nil
+}


This whole filterActiveClusters seems to be mostly an optimization that you've added "by the way"?

Looking at the original logic, IIUC, the set of nominated workers was based just on group.remotes (here), that in turn based on this call. Digging deeper, I didn't see anything like checking MultiKueueClusterActive. I guess it makes sense, true, but I'd vote for separating optimizations from refactors (I mean, into separate PRs).

For this refactoring PR, I'd consider ways to inject wlReconciler into this reconciler and just call its remoteClientsForAc method. (There are precedents, e.g. wlReconciler knows the clustersReconciler, here). Not necessarily strictly this way, but sth like this, to reduce duplication of code.

Then, in a follow-up PR, you're welcome to add this smarter filtering. But maybe not only here? Maybe the other dispatchers could also benefit from that?

WDYT?

This was not intended as smart filtering.

it was added because the integration test test/integration/multikueue/setup_test.go:L752 was failing on the refactor branch. That test ("Should properly detect insecure kubeconfig of MultiKueueClusters and remove remote client") explicitly asserts that an inactive cluster does not appear in Status.NominatedClusterNames.

The old code passed this test structurally because nominations came from group.remotes (in-process map maintained by clustersReconciler), which never contains a disconnected cluster. The new dispatcher pulls from admissioncheck.GetRemoteClusters() which returns configured clusters regardless of activity, so without the filter that test fails.

That's correct, it only proves how embedded AllAtOnce was, good catch @andrewseif

olekzabl · 2026-05-08T23:37:49Z

+}
+
+func (r *AllAtOnceDispatcherReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
+	log := ctrl.LoggerFrom(ctx)


AFAICS the first 40 lines of this method are almost identical as in incrementaldispatcher.go.
This raises 2 questions:

Can this be unified?
The differences seem not very blocking - they're only about r.clearRoundStartTime in the incr dispatcher; this could be passed as a callback "what should the common code do with a troubling error".

The first real difference is your newly-added special handling of eviction.
Though then - maybe it'd make sense to add it to the incremental dispatcher as well?
(Hence, again, I'd prefer to deal with it in a dedicated issue, and a dedicated fixing PR, separate from refactoring).

andrewseif · 2026-05-21T09:54:31Z

@mimowo I sent my investigation to both @olekzabl and @mszadkow, but I haven't gotten any review/feedback from them yet.

And I am not sure what would be missing here, if any.
the current branch is just pending rebase, and that's it.

olekzabl · 2026-05-21T10:05:54Z

I apologize @andrewseif , I must declare bankruptcy on this PR, at least until Wednesday.
I haven't managed to look at your investigation yet.

And I am not sure what would be missing here, if any.
the current branch is just pending rebase, and that's it.

Well... there still are some comments from me which you haven't responded to?

mimowo · 2026-05-21T10:16:31Z

/test pull-kueue-priority-booster-test-integration-main
/test pull-kueue-verify-main
Checking if these are flakes, but pull-kueue-verify-main seems like permanent

andrewseif · 2026-05-21T10:27:46Z

[pull-kueue-verify-main] is a small linter fix I need to deploy, for the open comments, I think the investigation answers most of them, except the architectural one, specifically this

AFAICS the first 40 lines of this method are almost identical as in incrementaldispatcher.go.
This raises 2 questions:

Can this be unified?
The differences seem not very blocking - they're only about r.clearRoundStartTime in the incr dispatcher; this could be passed as a callback "what should the common code do with a troubling error".

The first real difference is your newly-added special handling of eviction.
Though then - maybe it'd make sense to add it to the incremental dispatcher as well?
(Hence, again, I'd prefer to deal with it in a dedicated issue, and a dedicated fixing PR, separate from refactoring).

I think this might need to be discussed in our wg-batch meeting.

I can answer from a software design perspective, but I am not sure I can answer from a kueue architecture direction, which I think @olekzabl is referring to

… incremental one.

…until job reconciler runs

…id race condition windows

…licate the log error

Co-authored-by: Olek Zabłocki <olekz@google.com>

olekzabl · 2026-05-21T14:48:59Z

I can answer from a software design perspective, but I am not sure I can answer from a kueue architecture direction, which I think @olekzabl is referring to

My intent basically is to reduce the divergences between AllAtOnce and Incremental, because such divergences - especially if not clear at the first glance - feel like a risk of having some issues on one of the sides.
(See this comment.
BTW I still think it could be valuable to start writing "external AllAtOnce" by taking "Incremental with N = 1000000" at least as a starting point. Or, if it won't work because Incremental has its own issues, it'll be great to know that).

I consider it as a software engineering healthy practice, rather than "Kueue architecture".

mimowo · 2026-05-21T18:08:28Z

I can answer from a software design perspective, but I am not sure I can answer from a kueue architecture direction, which I think @olekzabl is referring to

My intent basically is to reduce the divergences between AllAtOnce and Incremental, because such divergences - especially if not clear at the first glance - feel like a risk of having some issues on one of the sides. (See this comment. BTW I still think it could be valuable to start writing "external AllAtOnce" by taking "Incremental with N = 1000000" at least as a starting point. Or, if it won't work because Incremental has its own issues, it'll be great to know that).

I consider it as a software engineering healthy practice, rather than "Kueue architecture".

I totally agree there is a lot of duplication between the two dispatchers, and we should commonize the code. Additionally, the new dispatcher seems to do a much better job by tracking changes to MultiKueueConfig, which the previous didn't do - and I think this is a bug in the incrementaldispatcher which just didn't notice. We could get the bug for incremental dispatcher fixed for free if we commonized the code.

On "how to" commonize the code I'm wondering architecturally what is better - and I'm not sure:

single "GeneticDispatcher" with two modes AllAtOnce and Incremental (single Reconciler)
two dispatchers which call a commonized function like GenericReconcile (two Reconcilers)

I'm leaning towards (1.) as this avoids duplication for the event handlers. I thought an argument for (2.) could be load separation, so similarly as @olekzabl suggested, but I just wouldn't say "Incremental with N = 1000000", but GenericDispatcher which supports both modes.

Let me know @andrewseif if this makes sense.

mszadkow · 2026-05-22T10:01:27Z

@andrewseif I read through your investigation, it appears to be correct.
I don't think it's a bug necessarily, but rather the effect of how embedded the AllAtOnce is into the MK Workload Reconciler.

k8s-ci-robot · 2026-05-22T10:09:25Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-05-22T21:15:17Z

@andrewseif: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kueue-test-e2e-main-1-36	`5ea862e`	link	true	`/test pull-kueue-test-e2e-main-1-36`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. area/multikueue Issues or PRs related to MultiKueue labels May 4, 2026

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 4, 2026

k8s-ci-robot requested review from mimowo and pajakd May 4, 2026 22:17

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 4, 2026

andrewseif changed the title ~~add allAtOne dispatcher files, and fix wiring~~ Move AllAtOnce MultiKueue dispatcher to a dedicated controller May 4, 2026

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 5, 2026

mszadkow reviewed May 5, 2026

View reviewed changes

Comment thread pkg/controller/workloaddispatcher/allatoncedispatcher_test.go

mszadkow reviewed May 5, 2026

View reviewed changes

Comment thread pkg/controller/workloaddispatcher/allatonedispatcher.go Outdated

andrewseif force-pushed the Issue-6803-move-AllAtOnce-to-Controller branch from 109b30d to fb7c981 Compare May 5, 2026 08:54

mimowo reviewed May 5, 2026

View reviewed changes

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 5, 2026

mimowo reviewed May 7, 2026

View reviewed changes

Comment thread pkg/controller/workloaddispatcher/incrementaldispatcher.go Outdated

mimowo reviewed May 7, 2026

View reviewed changes

Comment thread pkg/controller/admissionchecks/multikueue/workload.go

olekzabl reviewed May 8, 2026

View reviewed changes

andrewseif and others added 16 commits May 21, 2026 13:34

add allAtOne dispatcher files, and fix wiring

04889ce

add cluster and config watchers

806ff84

fix typo, rename files

5a25784

add new dispatcher in integration tests

fc68bcc

add an explicit IsEvicted(wl) skip to the new dispatcher, and the old…

c0e38ce

… incremental one.

update nominateAndSynchronizeWorkers to leave evicted workload alone …

582e556

…until job reconciler runs

use actual workload instead of cache to avoid cache staleness and avo…

7f55da7

…id race condition windows

add feature gate, and update the code

1c6850e

remove eviction gate from incremental dispatcher

88a077f

remove eviction gate guard from nominateAndSynchronizeWorkers, to rep…

414274c

…licate the log error

add eviction gate guard to nominateAndSynchronizeWorkers

0c63bb6

Update pkg/controller/admissionchecks/multikueue/workload.go

d4f4e06

Co-authored-by: Olek Zabłocki <olekz@google.com>

Update pkg/controller/admissionchecks/multikueue/workload.go

79d1837

Co-authored-by: Olek Zabłocki <olekz@google.com>

Update pkg/controller/workloaddispatcher/allatoncedispatcher.go

c1538c6

Co-authored-by: Olek Zabłocki <olekz@google.com>

Update pkg/controller/admissionchecks/multikueue/workload.go

5369cde

Co-authored-by: Olek Zabłocki <olekz@google.com>

use list instead of slice in allatoncedispatcher.go

83f6a2b

andrewseif force-pushed the Issue-6803-move-AllAtOnce-to-Controller branch from 4b13c9b to 83f6a2b Compare May 21, 2026 10:35

andrewseif added 2 commits May 21, 2026 13:37

fix kueue verify

ce9abaa

add all at once to feature list

5ea862e

mszadkow mentioned this pull request May 21, 2026

[Fix] Prevent remote wl creation with stale ClusterName #11378

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 22, 2026

Conversation

andrewseif commented May 4, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

netlify Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-kueue ready!

Uh oh!

k8s-ci-robot commented May 4, 2026

Uh oh!

k8s-ci-robot commented May 4, 2026

Uh oh!

tenzen-y commented May 5, 2026

Uh oh!

Uh oh!

Uh oh!

mimowo left a comment

Choose a reason for hiding this comment

Uh oh!

andrewseif commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewseif commented May 6, 2026

Uh oh!

mimowo commented May 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

olekzabl May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewseif May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mimowo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mimowo commented May 8, 2026

Uh oh!

olekzabl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

olekzabl May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

olekzabl May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andrewseif commented May 21, 2026

netlify Bot commented May 4, 2026 •

edited

Loading

andrewseif commented May 5, 2026 •

edited

Loading

olekzabl May 8, 2026 •

edited

Loading

andrewseif May 21, 2026 •

edited

Loading

olekzabl May 8, 2026 •

edited

Loading

olekzabl May 8, 2026 •

edited

Loading

olekzabl commented May 21, 2026 •

edited

Loading

mimowo commented May 21, 2026 •

edited

Loading