Skip to content

Move AllAtOnce MultiKueue dispatcher to a dedicated controller#10937

Open
andrewseif wants to merge 18 commits into
kubernetes-sigs:mainfrom
andrewseif:Issue-6803-move-AllAtOnce-to-Controller
Open

Move AllAtOnce MultiKueue dispatcher to a dedicated controller#10937
andrewseif wants to merge 18 commits into
kubernetes-sigs:mainfrom
andrewseif:Issue-6803-move-AllAtOnce-to-Controller

Conversation

@andrewseif
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind cleanup
/area multikueue

What this PR does / why we need it:

Move AllAtOnce MultiKueue dispatcher to a dedicated controller

Which issue(s) this PR fixes:

Fixes #6803

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. area/multikueue Issues or PRs related to MultiKueue labels May 4, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 4, 2026

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit 5ea862e
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a0ee9c7c210940008b07a7c
😎 Deploy Preview https://deploy-preview-10937--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 4, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @andrewseif. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: andrewseif
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from mimowo and pajakd May 4, 2026 22:17
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 4, 2026
@andrewseif andrewseif changed the title add allAtOne dispatcher files, and fix wiring Move AllAtOnce MultiKueue dispatcher to a dedicated controller May 4, 2026
@tenzen-y
Copy link
Copy Markdown
Member

tenzen-y commented May 5, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 5, 2026
Comment thread pkg/controller/workloaddispatcher/allatoncedispatcher_test.go
Comment thread pkg/controller/workloaddispatcher/allatonedispatcher.go Outdated
@andrewseif andrewseif force-pushed the Issue-6803-move-AllAtOnce-to-Controller branch from 109b30d to fb7c981 Compare May 5, 2026 08:54
Copy link
Copy Markdown
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the effort 👍

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 5, 2026
@andrewseif
Copy link
Copy Markdown
Contributor Author

andrewseif commented May 5, 2026

I had to add some logic to watch the config and clusters, as these were free in the inline version, but the logic itself is mostly similar

@andrewseif
Copy link
Copy Markdown
Contributor Author

/retest

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 7, 2026

cc @Singularity23x0 @olekzabl @kshalot ptal

Comment on lines +786 to +789
if workload.IsEvicted(remoteWl) {
log.V(3).Info("Preserving evicted remote workload to allow eviction-recovery sync", "remote", rem)
continue
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting does it mean we have a bug for the Increamental dispatcher which is already extracted? If this is the case maybe we could start by showing the bug, and fixing it here. It will also make the PRs more dijestable by the split.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say it was a bug, but there was a race window that was apparent, when I extracted the AllAtOnce MK dispatcher, ill make sure to post the findings for verification in wg-batch for more experienced eyes to verify my findings

Copy link
Copy Markdown
Contributor

@olekzabl olekzabl May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have trouble following the issue description in the comment here.

Mostly because, in the handling of Evicted condition which you mentioned (I suppose, here), the only explicit call to SyncJob is here, i.e. in the case of a manager-originating eviction, while your case seems to be the other one (the worker-originating eviction, dealt with here), given that you care whether the manager will notice.

This might be not yet contradictory; maybe you've found a longer path (a ping-pong across a few reconcilers?) leading to calling SyncJob also in the worker-originating case?

But anyway, my bottom line is:

  • +1 to documenting this as a separate issue #N
  • and then, instead of summarizing that issue in a comment here, I'd just leave a link to #N, because:
    • even a several-lines summary can be hard to follow (as I'm right now experiencing)
    • an even longer summary does not feel fit in this place
    • #N will act as a place where we can further discuss (while such comments are more "frozen")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have sent you my investigation findings, maybe you can verify them, I believe moving the AllAtOnce dispatcher created this, and it should be part of the PR, as without it the program won't function properly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this is related to this problem which is also attempted to be fixed here: #11378

cc @mszadkow wdyt?

Copy link
Copy Markdown
Contributor Author

@andrewseif andrewseif May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have looked at what @mszadkow did, it's in the same test suite, I think they are both targeting the same race condition, and it was showing in two different tests.

edit: I think it should solve #11115 aswell?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say it was a bug, but there was a race window that was apparent, when I extracted the AllAtOnce MK dispatcher, ill make sure to post the findings for verification in wg-batch for more experienced eyes to verify my findings

@andrewseif actually most of "races" are bugs, so I would like to understand which test and how is failing. It very well might be that you have discovered a bug we should extract to a separate preparatory bugfix PR we should cherrypick.

In order to let us understand what is the race I would recommend that you temporarily revert (or comment out the code) so that we can see what is the failure, and analyze it. Then we can make an informed decision if this is a separate bugfix or part of this PR.

Comment thread pkg/controller/workloaddispatcher/incrementaldispatcher.go Outdated
Copy link
Copy Markdown
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mszadkow who worked before on extrating / adding he incremental dispatcher. Ptal

Comment thread pkg/controller/admissionchecks/multikueue/workload.go
@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 8, 2026

@andrewseif I 'm planning to conclude the review next week, overall it looks great, but another pair of eyes from MK experts (@olekzabl or @kshalot ) would be great. Thank you for the effort once again 👍

Copy link
Copy Markdown
Contributor

@olekzabl olekzabl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you for doing this!

Then, even before going into detailed comments, I'm feeling I should start from a "provoking" question:
The incremental dispatcher is currently being parametrized with step size (#10877).
Given that, what if we "implemented" the "externalized" AllAtOnce just as a special case of that?
standaloneAllAtOnce := incrementalDispatcher{stepSize: 1000000}

You could say that's too simplistic, wasting performance etc. Because incremental dispatcher contains some bits of logic that we don't need. Perhaps.

But even if so, I'd like to ask how much of them we can have in common. Maybe extracting some shared pieces. Or maybe a central shared entry point with injectable per-case callbacks. IDK yet.

I'm just intuitively afraid of nearly-duplicating ~200 lines of code which may then diverge without a good reason. (And, in my eyes, this "unjustified divergence" shows up already in this PR. See my detailed comments).

I haven't yet read everything but must pause now. Will come back later.

Comment thread pkg/controller/admissionchecks/multikueue/workload.go Outdated
Comment thread pkg/controller/admissionchecks/multikueue/workload.go Outdated
Comment on lines +786 to +789
if workload.IsEvicted(remoteWl) {
log.V(3).Info("Preserving evicted remote workload to allow eviction-recovery sync", "remote", rem)
continue
}
Copy link
Copy Markdown
Contributor

@olekzabl olekzabl May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have trouble following the issue description in the comment here.

Mostly because, in the handling of Evicted condition which you mentioned (I suppose, here), the only explicit call to SyncJob is here, i.e. in the case of a manager-originating eviction, while your case seems to be the other one (the worker-originating eviction, dealt with here), given that you care whether the manager will notice.

This might be not yet contradictory; maybe you've found a longer path (a ping-pong across a few reconcilers?) leading to calling SyncJob also in the worker-originating case?

But anyway, my bottom line is:

  • +1 to documenting this as a separate issue #N
  • and then, instead of summarizing that issue in a comment here, I'd just leave a link to #N, because:
    • even a several-lines summary can be hard to follow (as I'm right now experiencing)
    • an even longer summary does not feel fit in this place
    • #N will act as a place where we can further discuss (while such comments are more "frozen")

Comment thread pkg/controller/workloaddispatcher/allatoncedispatcher.go Outdated
Comment thread pkg/controller/admissionchecks/multikueue/workload.go Outdated
return reconcile.Result{}, nil
}

// The workload is already assigned to a cluster, no need to nominate workers.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this comment feels not very useful, given that it's duplicated in the log text just below.
(Though I'm aware that it looks so also in incrementaldispatcher.go).

Comment on lines +138 to +157
// filterActiveClusters returns the subset of remoteClusters whose MultiKueueCluster
// has the MultiKueueClusterActive condition set to True. Clusters that are missing
// or not active are excluded so they are not nominated for workload placement.
func (r *AllAtOnceDispatcherReconciler) filterActiveClusters(ctx context.Context, remoteClusters sets.Set[string]) (sets.Set[string], error) {
active := sets.New[string]()
for clusterName := range remoteClusters {
cluster := &kueue.MultiKueueCluster{}
if err := r.client.Get(ctx, types.NamespacedName{Name: clusterName}, cluster); err != nil {
if client.IgnoreNotFound(err) != nil {
return nil, err
}
// Missing cluster: skip.
continue
}
if apimeta.IsStatusConditionTrue(cluster.Status.Conditions, kueue.MultiKueueClusterActive) {
active.Insert(clusterName)
}
}
return active, nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole filterActiveClusters seems to be mostly an optimization that you've added "by the way"?

Looking at the original logic, IIUC, the set of nominated workers was based just on group.remotes (here), that in turn based on this call. Digging deeper, I didn't see anything like checking MultiKueueClusterActive. I guess it makes sense, true, but I'd vote for separating optimizations from refactors (I mean, into separate PRs).

For this refactoring PR, I'd consider ways to inject wlReconciler into this reconciler and just call its remoteClientsForAc method. (There are precedents, e.g. wlReconciler knows the clustersReconciler, here). Not necessarily strictly this way, but sth like this, to reduce duplication of code.

Then, in a follow-up PR, you're welcome to add this smarter filtering. But maybe not only here? Maybe the other dispatchers could also benefit from that?

WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not intended as smart filtering.

it was added because the integration test test/integration/multikueue/setup_test.go:L752 was failing on the refactor branch. That test ("Should properly detect insecure kubeconfig of MultiKueueClusters and remove remote client") explicitly asserts that an inactive cluster does not appear in Status.NominatedClusterNames.

The old code passed this test structurally because nominations came from group.remotes (in-process map maintained by clustersReconciler), which never contains a disconnected cluster. The new dispatcher pulls from admissioncheck.GetRemoteClusters() which returns configured clusters regardless of activity, so without the filter that test fails.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, it only proves how embedded AllAtOnce was, good catch @andrewseif

}

func (r *AllAtOnceDispatcherReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := ctrl.LoggerFrom(ctx)
Copy link
Copy Markdown
Contributor

@olekzabl olekzabl May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS the first 40 lines of this method are almost identical as in incrementaldispatcher.go.
This raises 2 questions:

  1. Can this be unified?
    The differences seem not very blocking - they're only about r.clearRoundStartTime in the incr dispatcher; this could be passed as a callback "what should the common code do with a troubling error".

  2. The first real difference is your newly-added special handling of eviction.
    Though then - maybe it'd make sense to add it to the incremental dispatcher as well?
    (Hence, again, I'd prefer to deal with it in a dedicated issue, and a dedicated fixing PR, separate from refactoring).

Comment thread pkg/controller/workloaddispatcher/allatoncedispatcher.go Outdated
@andrewseif
Copy link
Copy Markdown
Contributor Author

@mimowo I sent my investigation to both @olekzabl and @mszadkow, but I haven't gotten any review/feedback from them yet.

And I am not sure what would be missing here, if any.
the current branch is just pending rebase, and that's it.

@olekzabl
Copy link
Copy Markdown
Contributor

I apologize @andrewseif , I must declare bankruptcy on this PR, at least until Wednesday.
I haven't managed to look at your investigation yet.

And I am not sure what would be missing here, if any.
the current branch is just pending rebase, and that's it.

Well... there still are some comments from me which you haven't responded to?

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 21, 2026

/test pull-kueue-priority-booster-test-integration-main
/test pull-kueue-verify-main
Checking if these are flakes, but pull-kueue-verify-main seems like permanent

@andrewseif
Copy link
Copy Markdown
Contributor Author

[pull-kueue-verify-main] is a small linter fix I need to deploy, for the open comments, I think the investigation answers most of them, except the architectural one, specifically this

AFAICS the first 40 lines of this method are almost identical as in incrementaldispatcher.go.
This raises 2 questions:

Can this be unified?
The differences seem not very blocking - they're only about r.clearRoundStartTime in the incr dispatcher; this could be passed as a callback "what should the common code do with a troubling error".

The first real difference is your newly-added special handling of eviction.
Though then - maybe it'd make sense to add it to the incremental dispatcher as well?
(Hence, again, I'd prefer to deal with it in a dedicated issue, and a dedicated fixing PR, separate from refactoring).

I think this might need to be discussed in our wg-batch meeting.

I can answer from a software design perspective, but I am not sure I can answer from a kueue architecture direction, which I think @olekzabl is referring to

@andrewseif andrewseif force-pushed the Issue-6803-move-AllAtOnce-to-Controller branch from 4b13c9b to 83f6a2b Compare May 21, 2026 10:35
@olekzabl
Copy link
Copy Markdown
Contributor

olekzabl commented May 21, 2026

I can answer from a software design perspective, but I am not sure I can answer from a kueue architecture direction, which I think @olekzabl is referring to

My intent basically is to reduce the divergences between AllAtOnce and Incremental, because such divergences - especially if not clear at the first glance - feel like a risk of having some issues on one of the sides.
(See this comment.
BTW I still think it could be valuable to start writing "external AllAtOnce" by taking "Incremental with N = 1000000" at least as a starting point. Or, if it won't work because Incremental has its own issues, it'll be great to know that).

I consider it as a software engineering healthy practice, rather than "Kueue architecture".

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 21, 2026

I can answer from a software design perspective, but I am not sure I can answer from a kueue architecture direction, which I think @olekzabl is referring to

My intent basically is to reduce the divergences between AllAtOnce and Incremental, because such divergences - especially if not clear at the first glance - feel like a risk of having some issues on one of the sides. (See this comment. BTW I still think it could be valuable to start writing "external AllAtOnce" by taking "Incremental with N = 1000000" at least as a starting point. Or, if it won't work because Incremental has its own issues, it'll be great to know that).

I consider it as a software engineering healthy practice, rather than "Kueue architecture".

I totally agree there is a lot of duplication between the two dispatchers, and we should commonize the code. Additionally, the new dispatcher seems to do a much better job by tracking changes to MultiKueueConfig, which the previous didn't do - and I think this is a bug in the incrementaldispatcher which just didn't notice. We could get the bug for incremental dispatcher fixed for free if we commonized the code.

On "how to" commonize the code I'm wondering architecturally what is better - and I'm not sure:

  1. single "GeneticDispatcher" with two modes AllAtOnce and Incremental (single Reconciler)
  2. two dispatchers which call a commonized function like GenericReconcile (two Reconcilers)

I'm leaning towards (1.) as this avoids duplication for the event handlers. I thought an argument for (2.) could be load separation, so similarly as @olekzabl suggested, but I just wouldn't say "Incremental with N = 1000000", but GenericDispatcher which supports both modes.

Let me know @andrewseif if this makes sense.

@mszadkow
Copy link
Copy Markdown
Contributor

@andrewseif I read through your investigation, it appears to be correct.
I don't think it's a bug necessarily, but rather the effect of how embedded the AllAtOnce is into the MK Workload Reconciler.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 22, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@andrewseif: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-test-e2e-main-1-36 5ea862e link true /test pull-kueue-test-e2e-main-1-36

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/multikueue Issues or PRs related to MultiKueue cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Move AllAtOnce MultiKueue dispatcher to a dedicated controller

6 participants