WIP: [KEP] composable multikueue dispatcher by amy · Pull Request #10784 · kubernetes-sigs/kueue

amy · 2026-04-26T04:04:55Z

What type of PR is this?

THIS IS VERY MUCH A WIP. Made this PR so that people know that its being worked on.

/kind kep

What this PR does / why we need it:

We need a composable dispatcher that allows for different routing policies. The overall goal is to treat clusters like nodes with concepts like filtering and scoring, and also node anti/affinity.

Which issue(s) this PR partially fixes:

Part of: #10766

Special notes for your reviewer:

its a WIP

Does this PR introduce a user-facing change?

NONE

netlify · 2026-04-26T04:05:02Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`4ec7a75`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/69ed8fa3c8a2bb0008d7671c

k8s-ci-robot · 2026-04-26T04:05:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: amy
Once this PR has been reviewed and has the lgtm label, please assign gabesaba for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

amy · 2026-04-26T04:13:24Z

/area multikueue

k8s-ci-robot · 2026-04-26T04:22:17Z

@amy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kueue-verify-main	`4ec7a75`	link	true	`/test pull-kueue-verify-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

mwielgus · 2026-04-27T10:33:29Z

+  round timeout. Cluster order is unrelated to any workload or cluster property, so
+  placement is essentially arbitrary.
+
+Neither strategy allows workloads to express placement requirements ("run only on GPU clusters


Playing a devil's advocate a bit. In theory there can be a completely custom, user-provided dispatcher. I'm wondering why do we need fine grain extensions points which would require some controller anyway. Maybe we should just provide a reference implementation and encourage users to fork it?

I need a cloud agnostic framework that users use across clouds. And then be able to have custom configuration recommendations. The first pass of this would be a dispatcher that's in tree anyways. See: https://github.com/kubernetes-sigs/kueue/pull/10784/changes/BASE..4ec7a758f51f1527cfda7c9e527cc2f307db2f3b#diff-ab13a7d3e9220a7d4c936bdf44adc4aaffdd91eecc1d524da90bac251b82ba5aR106

Like I need to be able to have filtering and scoring semantics for basic affinity/anti-affinity stuff. So sure you could have people write a new one that has no policy configuration but have the policy thats embedded into the code. But that... is a bad experience and not reusable

This is just another dispatcher option that sits alongside Incremental and AllAtOnce as a third option.

olekzabl

Hello @amy , my apologies for reading it so late.

I generally like the ideas here, and I think the topic is essential for MultiKueue.

I didn't manage to read very deeply - but let me leave a few early questions.

olekzabl · 2026-05-20T13:30:57Z

+| `clusterSelector` | `kueue.x-k8s.io/cluster-selector` (JSON `map[string]string`) |
+| `clusterAffinity` | `kueue.x-k8s.io/cluster-affinity` (JSON-encoded `ClusterAffinity`) |
+| `clusterTolerations` | `kueue.x-k8s.io/cluster-tolerations` (JSON array of `ClusterToleration`) |


I'm anxious about these JSON encodings.
AFAICS JSON encoding does not happen for "pod placement fields", nor elsewhere in Kueue; even in K8s ecosystem it seems rather rare (and, when present, not fully standardized).

For the user, this would be typo-prone. (Cf. item 1 in your Risks, but not just label values, also JSON syntax).
I understand it's webhook-enforceable, still unpleasant.

I do see where it comes from - we want these specified for the original jobs (not Workloads), which can have various types; hence annotations; hence flat string values.

Still, maybe there are alternatives?
E.g. define a new structured CRD to hold all these fields and just refer an instance of this CRD in a job annotation value?
(We could also accept annotations for Alpha, for more elastic experimenting, and then plan for standardizing into a CRD for Beta).
This could win some type safety, but at the cost of inconvenience of spreading spec across resources.

WDYT?

olekzabl · 2026-05-20T14:00:08Z

+- `ClusterFeasibility` filter fully implemented (requires [kubernetes-sigs/kueue#10105](https://github.com/kubernetes-sigs/kueue/issues/10105)).
+- `CapacityScore` plugin implemented (also requires [kubernetes-sigs/kueue#10105](https://github.com/kubernetes-sigs/kueue/issues/10105)).


How do these 2 differ?
Is it filter ("required") vs. score ("preferred")? Or sth more?

olekzabl · 2026-05-20T14:04:32Z

+    // Only evaluated when a MultiKueue AdmissionCheck is active on the workload's
+    // ClusterQueue.
+    // +optional
+    ClusterSelector map[string]string `json:"clusterSelector,omitempty"`


Nit: do I understand correctly that the value (even after JSON-decoding) will not be of the ClusterSelector type defined below? (Instead, it'll be a standard label selector)?
If so, this may be confusing.

olekzabl · 2026-05-20T19:11:13Z

+- Implement `ClusterFeasibility` filter: rejects clusters that cannot admit the workload
+  based on quota headroom visible in `MultiKueueCluster.Status`. Blocked on
+  [kubernetes-sigs/kueue#10105](https://github.com/kubernetes-sigs/kueue/issues/10105)


What is meant by "quota headroom" - is it "unreserved capacity"?
If so, defining a strict filter based on that may be too strict - even if a workload "seems not to fit", it could still borrow or preempt.

I imagine we could strictly reject a workload if its requests exceed worker (whole) capacity.
"Unreserved capacity" feels like valuable information - though ideally only for soft scoring.

olekzabl · 2026-05-20T20:24:13Z

+}
+```
+
+### Workload Annotation: Dispatch Mode


I can't understand the rationale for BestEffort.
I'm not sure when the misunderstanding is - so let me try nail it down with the following questions / remarks:

In BestEffort, when a workload is dispatched to a worker and the AC is marked Ready, what will be the workload status on manager & selected worker?

A: Admitted on manager / QuotaReserved on worker?
B:Admitted on both (even though actually not running yet on the worker)?

I assumed A but my AI friend claimed it's B. Hence asking.

If it's 1B ("Admitted on both"), doesn't it break some contracts?
What if a worker has its own AdmissionChecks - would we just magically skip them?

If the remote cluster later evicts the workload, the eviction propagates back to the manager
You say this for BestEffort - but why wouldn't it hold also for MustBeAdmitted?

If the remote does not admit within workerLostTimeout, the dispatcher retries with the next-best cluster.
You say this for MustBeAdmitted - but why wouldn't it hold also for BestEffort?
(At least assuming 1A = "QuotaReserved on worker" - this could stay for long, even in BestEffort).

You mention workerLostTimeout but I'm feeling it doesn't fit in this story.

IIUC the intent behind workerLostTimeout is to handle cases when we lost connection to a worker cluster on which the remote workload had already started; see this comment and the description of [multikueue] Manage worker cluster unavailability #1681.
In that case, we want to speculatively assume for some time that the workload is still running there - to avoid re-dispatching which could later turn out to have been wasteful.

The case in this KEP seems very different. Here (IIUC) we still can see the worker cluster but the workload failed to start on it.
This feels more like waitForPodsReady (though admittedly can't be directly expressed by that).

I can't help the feeling that the choice between BestEffort / MustBeAdmitted:

will not affect the actual status (running / not running) of the workload on the 1st selected worker, or the timeline of that status

ideally should not affect the timeline of manager's reaction to any worker-side complications

and hence, by induction, ideally should have no effect on "placement latency" at all

by definition, it affects local workload statuses - but this feels just abstract labeling?

Historically, the early MultiKueue behavior was "mark the workload Admitted on the manager as soon as it got QuotaReserved on some worker" - but then, it was considered imperfect and evolved towards "let manager-side admission follow worker-side admission" (MultiKueue: workloads from worker clusters are deleted prematurely #8585).
In this context, introducing BestEffort seems to be a step in the opposite direction (and even more so making it default).
So I'd really like to understand the reason for this.

olekzabl · 2026-05-20T20:27:52Z

+- **Top-K nomination.** The current dispatcher nominates exactly one cluster per cycle
+  (top-1). A future KEP can add parallel nomination of K clusters simultaneously as an
+  opt-in `nominationMode: TopK` field on `ClusterDispatcherProfile`, reducing tail latency
+  when the top-scoring cluster is slow to admit. Top-1 remains the default.


Or, without introducing "nomination modes", just expose an *int for K, where nil means 1.
Cf. KEP-9270 which proposes a similar thing for the incremental dispatcher.

BTW I'd reconsider if adding this wouldn't be valuable right away.
How well can we predict if a worker will actually run the workload?
I think not too well, given preemptions, borrowing, ProvReqs etc.
(I mean, even with #10105 in place. That feature is just about some rough quota / usage stats; it won't solve any of the aspects mentioned above).

amy · 2026-05-20T22:05:58Z

Hello @amy , my apologies for reading it so late.
I generally like the ideas here, and I think the topic is essential for MultiKueue.
I didn't manage to read very deeply - but let me leave a few early questions.

@olekzabl No worries! Just plopped it here even though it's very much a wip bc I wanted to let people know I'm thinking about it and really need it. I'll probably engage with this deeply towards the end of 0.19 / start of 0.20 release.

Also, tbh I need to do more work on understanding the gaps that need to be filled in order for this kep to be possible.

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Apr 26, 2026

k8s-ci-robot requested review from kshalot and sohankunkerkar April 26, 2026 04:05

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 26, 2026

amy force-pushed the composable branch 2 times, most recently from 05b8f48 to dbde49d Compare April 26, 2026 04:06

composable multikueue dispatcher

4ec7a75

amy force-pushed the composable branch from dbde49d to 4ec7a75 Compare April 26, 2026 04:08

amy mentioned this pull request Apr 26, 2026

[MultiKueue] Composable dispatcher KEP #10766

Open

3 tasks

k8s-ci-robot added the area/multikueue Issues or PRs related to MultiKueue label Apr 26, 2026

mwielgus reviewed Apr 27, 2026

View reviewed changes

olekzabl reviewed May 20, 2026

View reviewed changes

This was referenced May 21, 2026

[MultiKueue] Cross-cluster preemption — reclaim cohort quota from sibling-cluster borrowers #11375

Open

KEP-11375: MultiKueue Cross-Cluster Preemption (alpha) #11376

Open

		- `ClusterFeasibility` filter fully implemented (requires [kubernetes-sigs/kueue#10105](https://github.com/kubernetes-sigs/kueue/issues/10105)).
		- `CapacityScore` plugin implemented (also requires [kubernetes-sigs/kueue#10105](https://github.com/kubernetes-sigs/kueue/issues/10105)).

Conversation

amy commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR partially fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

netlify Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Uh oh!

k8s-ci-robot commented Apr 26, 2026

Uh oh!

amy commented Apr 26, 2026

Uh oh!

k8s-ci-robot commented Apr 26, 2026

Uh oh!

mwielgus Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

amy Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

olekzabl left a comment

Choose a reason for hiding this comment

Uh oh!

olekzabl May 20, 2026

Choose a reason for hiding this comment

Uh oh!

olekzabl May 20, 2026

Choose a reason for hiding this comment

Uh oh!

olekzabl May 20, 2026

Choose a reason for hiding this comment

Uh oh!

olekzabl May 20, 2026

Choose a reason for hiding this comment

Uh oh!

olekzabl May 20, 2026

Choose a reason for hiding this comment

Uh oh!

olekzabl May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amy commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

amy commented Apr 26, 2026 •

edited

Loading

netlify Bot commented Apr 26, 2026 •

edited

Loading

olekzabl May 20, 2026 •

edited

Loading

amy commented May 20, 2026 •

edited

Loading