Skip to content

WIP: [KEP] composable multikueue dispatcher#10784

Open
amy wants to merge 1 commit into
kubernetes-sigs:mainfrom
amy:composable
Open

WIP: [KEP] composable multikueue dispatcher#10784
amy wants to merge 1 commit into
kubernetes-sigs:mainfrom
amy:composable

Conversation

@amy
Copy link
Copy Markdown
Contributor

@amy amy commented Apr 26, 2026

What type of PR is this?

THIS IS VERY MUCH A WIP. Made this PR so that people know that its being worked on.

/kind kep

What this PR does / why we need it:

We need a composable dispatcher that allows for different routing policies. The overall goal is to treat clusters like nodes with concepts like filtering and scoring, and also node anti/affinity.

Which issue(s) this PR partially fixes:

Part of: #10766

Special notes for your reviewer:

its a WIP

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Apr 26, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 26, 2026

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 4ec7a75
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/69ed8fa3c8a2bb0008d7671c

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: amy
Once this PR has been reviewed and has the lgtm label, please assign gabesaba for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 26, 2026
@amy amy force-pushed the composable branch 2 times, most recently from 05b8f48 to dbde49d Compare April 26, 2026 04:06
@amy
Copy link
Copy Markdown
Contributor Author

amy commented Apr 26, 2026

/area multikueue

@k8s-ci-robot k8s-ci-robot added the area/multikueue Issues or PRs related to MultiKueue label Apr 26, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@amy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-verify-main 4ec7a75 link true /test pull-kueue-verify-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

round timeout. Cluster order is unrelated to any workload or cluster property, so
placement is essentially arbitrary.

Neither strategy allows workloads to express placement requirements ("run only on GPU clusters
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Playing a devil's advocate a bit. In theory there can be a completely custom, user-provided dispatcher. I'm wondering why do we need fine grain extensions points which would require some controller anyway. Maybe we should just provide a reference implementation and encourage users to fork it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need a cloud agnostic framework that users use across clouds. And then be able to have custom configuration recommendations. The first pass of this would be a dispatcher that's in tree anyways. See: https://github.com/kubernetes-sigs/kueue/pull/10784/changes/BASE..4ec7a758f51f1527cfda7c9e527cc2f307db2f3b#diff-ab13a7d3e9220a7d4c936bdf44adc4aaffdd91eecc1d524da90bac251b82ba5aR106

Like I need to be able to have filtering and scoring semantics for basic affinity/anti-affinity stuff. So sure you could have people write a new one that has no policy configuration but have the policy thats embedded into the code. But that... is a bad experience and not reusable

This is just another dispatcher option that sits alongside Incremental and AllAtOnce as a third option.

Copy link
Copy Markdown
Contributor

@olekzabl olekzabl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @amy , my apologies for reading it so late.

I generally like the ideas here, and I think the topic is essential for MultiKueue.

I didn't manage to read very deeply - but let me leave a few early questions.

Comment on lines +406 to +408
| `clusterSelector` | `kueue.x-k8s.io/cluster-selector` (JSON `map[string]string`) |
| `clusterAffinity` | `kueue.x-k8s.io/cluster-affinity` (JSON-encoded `ClusterAffinity`) |
| `clusterTolerations` | `kueue.x-k8s.io/cluster-tolerations` (JSON array of `ClusterToleration`) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm anxious about these JSON encodings.
AFAICS JSON encoding does not happen for "pod placement fields", nor elsewhere in Kueue; even in K8s ecosystem it seems rather rare (and, when present, not fully standardized).

For the user, this would be typo-prone. (Cf. item 1 in your Risks, but not just label values, also JSON syntax).
I understand it's webhook-enforceable, still unpleasant.

I do see where it comes from - we want these specified for the original jobs (not Workloads), which can have various types; hence annotations; hence flat string values.

Still, maybe there are alternatives?
E.g. define a new structured CRD to hold all these fields and just refer an instance of this CRD in a job annotation value?
(We could also accept annotations for Alpha, for more elastic experimenting, and then plan for standardizing into a CRD for Beta).
This could win some type safety, but at the cost of inconvenience of spreading spec across resources.

WDYT?

Comment on lines +726 to +727
- `ClusterFeasibility` filter fully implemented (requires [kubernetes-sigs/kueue#10105](https://github.com/kubernetes-sigs/kueue/issues/10105)).
- `CapacityScore` plugin implemented (also requires [kubernetes-sigs/kueue#10105](https://github.com/kubernetes-sigs/kueue/issues/10105)).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do these 2 differ?
Is it filter ("required") vs. score ("preferred")? Or sth more?

// Only evaluated when a MultiKueue AdmissionCheck is active on the workload's
// ClusterQueue.
// +optional
ClusterSelector map[string]string `json:"clusterSelector,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: do I understand correctly that the value (even after JSON-decoding) will not be of the ClusterSelector type defined below? (Instead, it'll be a standard label selector)?
If so, this may be confusing.

Comment on lines +98 to +100
- Implement `ClusterFeasibility` filter: rejects clusters that cannot admit the workload
based on quota headroom visible in `MultiKueueCluster.Status`. Blocked on
[kubernetes-sigs/kueue#10105](https://github.com/kubernetes-sigs/kueue/issues/10105)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by "quota headroom" - is it "unreserved capacity"?
If so, defining a strict filter based on that may be too strict - even if a workload "seems not to fit", it could still borrow or preempt.

I imagine we could strictly reject a workload if its requests exceed worker (whole) capacity.
"Unreserved capacity" feels like valuable information - though ideally only for soft scoring.

}
```

### Workload Annotation: Dispatch Mode
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't understand the rationale for BestEffort.
I'm not sure when the misunderstanding is - so let me try nail it down with the following questions / remarks:

  1. In BestEffort, when a workload is dispatched to a worker and the AC is marked Ready, what will be the workload status on manager & selected worker?

    A: Admitted on manager / QuotaReserved on worker?
    B:Admitted on both (even though actually not running yet on the worker)?

    I assumed A but my AI friend claimed it's B. Hence asking.

  2. If it's 1B ("Admitted on both"), doesn't it break some contracts?
    What if a worker has its own AdmissionChecks - would we just magically skip them?

  3. If the remote cluster later evicts the workload, the eviction propagates back to the manager
    You say this for BestEffort - but why wouldn't it hold also for MustBeAdmitted?

  4. If the remote does not admit within workerLostTimeout, the dispatcher retries with the next-best cluster.
    You say this for MustBeAdmitted - but why wouldn't it hold also for BestEffort?
    (At least assuming 1A = "QuotaReserved on worker" - this could stay for long, even in BestEffort).

  5. You mention workerLostTimeout but I'm feeling it doesn't fit in this story.

    • IIUC the intent behind workerLostTimeout is to handle cases when we lost connection to a worker cluster on which the remote workload had already started; see this comment and the description of [multikueue] Manage worker cluster unavailability #1681.
      In that case, we want to speculatively assume for some time that the workload is still running there - to avoid re-dispatching which could later turn out to have been wasteful.
    • The case in this KEP seems very different. Here (IIUC) we still can see the worker cluster but the workload failed to start on it.
      This feels more like waitForPodsReady (though admittedly can't be directly expressed by that).
  6. I can't help the feeling that the choice between BestEffort / MustBeAdmitted:

    • will not affect the actual status (running / not running) of the workload on the 1st selected worker, or the timeline of that status
    • ideally should not affect the timeline of manager's reaction to any worker-side complications
    • and hence, by induction, ideally should have no effect on "placement latency" at all
    • by definition, it affects local workload statuses - but this feels just abstract labeling?
  7. Historically, the early MultiKueue behavior was "mark the workload Admitted on the manager as soon as it got QuotaReserved on some worker" - but then, it was considered imperfect and evolved towards "let manager-side admission follow worker-side admission" (MultiKueue: workloads from worker clusters are deleted prematurely #8585).
    In this context, introducing BestEffort seems to be a step in the opposite direction (and even more so making it default).
    So I'd really like to understand the reason for this.

Comment on lines +778 to +781
- **Top-K nomination.** The current dispatcher nominates exactly one cluster per cycle
(top-1). A future KEP can add parallel nomination of K clusters simultaneously as an
opt-in `nominationMode: TopK` field on `ClusterDispatcherProfile`, reducing tail latency
when the top-scoring cluster is slow to admit. Top-1 remains the default.
Copy link
Copy Markdown
Contributor

@olekzabl olekzabl May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, without introducing "nomination modes", just expose an *int for K, where nil means 1.
Cf. KEP-9270 which proposes a similar thing for the incremental dispatcher.

BTW I'd reconsider if adding this wouldn't be valuable right away.
How well can we predict if a worker will actually run the workload?
I think not too well, given preemptions, borrowing, ProvReqs etc.
(I mean, even with #10105 in place. That feature is just about some rough quota / usage stats; it won't solve any of the aspects mentioned above).

@amy
Copy link
Copy Markdown
Contributor Author

amy commented May 20, 2026

Hello @amy , my apologies for reading it so late.
I generally like the ideas here, and I think the topic is essential for MultiKueue.
I didn't manage to read very deeply - but let me leave a few early questions.

@olekzabl No worries! Just plopped it here even though it's very much a wip bc I wanted to let people know I'm thinking about it and really need it. I'll probably engage with this deeply towards the end of 0.19 / start of 0.20 release.

Also, tbh I need to do more work on understanding the gaps that need to be filled in order for this kep to be possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/multikueue Issues or PRs related to MultiKueue cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants