WIP: [KEP] composable multikueue dispatcher#10784
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: amy The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
05b8f48 to
dbde49d
Compare
|
/area multikueue |
|
@amy: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| round timeout. Cluster order is unrelated to any workload or cluster property, so | ||
| placement is essentially arbitrary. | ||
|
|
||
| Neither strategy allows workloads to express placement requirements ("run only on GPU clusters |
There was a problem hiding this comment.
Playing a devil's advocate a bit. In theory there can be a completely custom, user-provided dispatcher. I'm wondering why do we need fine grain extensions points which would require some controller anyway. Maybe we should just provide a reference implementation and encourage users to fork it?
There was a problem hiding this comment.
I need a cloud agnostic framework that users use across clouds. And then be able to have custom configuration recommendations. The first pass of this would be a dispatcher that's in tree anyways. See: https://github.com/kubernetes-sigs/kueue/pull/10784/changes/BASE..4ec7a758f51f1527cfda7c9e527cc2f307db2f3b#diff-ab13a7d3e9220a7d4c936bdf44adc4aaffdd91eecc1d524da90bac251b82ba5aR106
Like I need to be able to have filtering and scoring semantics for basic affinity/anti-affinity stuff. So sure you could have people write a new one that has no policy configuration but have the policy thats embedded into the code. But that... is a bad experience and not reusable
This is just another dispatcher option that sits alongside Incremental and AllAtOnce as a third option.
| | `clusterSelector` | `kueue.x-k8s.io/cluster-selector` (JSON `map[string]string`) | | ||
| | `clusterAffinity` | `kueue.x-k8s.io/cluster-affinity` (JSON-encoded `ClusterAffinity`) | | ||
| | `clusterTolerations` | `kueue.x-k8s.io/cluster-tolerations` (JSON array of `ClusterToleration`) | |
There was a problem hiding this comment.
I'm anxious about these JSON encodings.
AFAICS JSON encoding does not happen for "pod placement fields", nor elsewhere in Kueue; even in K8s ecosystem it seems rather rare (and, when present, not fully standardized).
For the user, this would be typo-prone. (Cf. item 1 in your Risks, but not just label values, also JSON syntax).
I understand it's webhook-enforceable, still unpleasant.
I do see where it comes from - we want these specified for the original jobs (not Workloads), which can have various types; hence annotations; hence flat string values.
Still, maybe there are alternatives?
E.g. define a new structured CRD to hold all these fields and just refer an instance of this CRD in a job annotation value?
(We could also accept annotations for Alpha, for more elastic experimenting, and then plan for standardizing into a CRD for Beta).
This could win some type safety, but at the cost of inconvenience of spreading spec across resources.
WDYT?
| - `ClusterFeasibility` filter fully implemented (requires [kubernetes-sigs/kueue#10105](https://github.com/kubernetes-sigs/kueue/issues/10105)). | ||
| - `CapacityScore` plugin implemented (also requires [kubernetes-sigs/kueue#10105](https://github.com/kubernetes-sigs/kueue/issues/10105)). |
There was a problem hiding this comment.
How do these 2 differ?
Is it filter ("required") vs. score ("preferred")? Or sth more?
| // Only evaluated when a MultiKueue AdmissionCheck is active on the workload's | ||
| // ClusterQueue. | ||
| // +optional | ||
| ClusterSelector map[string]string `json:"clusterSelector,omitempty"` |
There was a problem hiding this comment.
Nit: do I understand correctly that the value (even after JSON-decoding) will not be of the ClusterSelector type defined below? (Instead, it'll be a standard label selector)?
If so, this may be confusing.
| - Implement `ClusterFeasibility` filter: rejects clusters that cannot admit the workload | ||
| based on quota headroom visible in `MultiKueueCluster.Status`. Blocked on | ||
| [kubernetes-sigs/kueue#10105](https://github.com/kubernetes-sigs/kueue/issues/10105) |
There was a problem hiding this comment.
What is meant by "quota headroom" - is it "unreserved capacity"?
If so, defining a strict filter based on that may be too strict - even if a workload "seems not to fit", it could still borrow or preempt.
I imagine we could strictly reject a workload if its requests exceed worker (whole) capacity.
"Unreserved capacity" feels like valuable information - though ideally only for soft scoring.
| } | ||
| ``` | ||
|
|
||
| ### Workload Annotation: Dispatch Mode |
There was a problem hiding this comment.
I can't understand the rationale for BestEffort.
I'm not sure when the misunderstanding is - so let me try nail it down with the following questions / remarks:
-
In
BestEffort, when a workload is dispatched to a worker and the AC is markedReady, what will be the workload status on manager & selected worker?A:
Admittedon manager /QuotaReservedon worker?
B:Admittedon both (even though actually not running yet on the worker)?I assumed A but my AI friend claimed it's B. Hence asking.
-
If it's 1B ("
Admittedon both"), doesn't it break some contracts?
What if a worker has its own AdmissionChecks - would we just magically skip them? -
If the remote cluster later evicts the workload, the eviction propagates back to the manager
You say this forBestEffort- but why wouldn't it hold also forMustBeAdmitted? -
If the remote does not admit within
workerLostTimeout, the dispatcher retries with the next-best cluster.
You say this forMustBeAdmitted- but why wouldn't it hold also forBestEffort?
(At least assuming 1A = "QuotaReservedon worker" - this could stay for long, even inBestEffort). -
You mention
workerLostTimeoutbut I'm feeling it doesn't fit in this story.- IIUC the intent behind
workerLostTimeoutis to handle cases when we lost connection to a worker cluster on which the remote workload had already started; see this comment and the description of [multikueue] Manage worker cluster unavailability #1681.
In that case, we want to speculatively assume for some time that the workload is still running there - to avoid re-dispatching which could later turn out to have been wasteful. - The case in this KEP seems very different. Here (IIUC) we still can see the worker cluster but the workload failed to start on it.
This feels more likewaitForPodsReady(though admittedly can't be directly expressed by that).
- IIUC the intent behind
-
I can't help the feeling that the choice between
BestEffort/MustBeAdmitted:- will not affect the actual status (running / not running) of the workload on the 1st selected worker, or the timeline of that status
- ideally should not affect the timeline of manager's reaction to any worker-side complications
- and hence, by induction, ideally should have no effect on "placement latency" at all
- by definition, it affects local workload statuses - but this feels just abstract labeling?
-
Historically, the early MultiKueue behavior was "mark the workload
Admittedon the manager as soon as it gotQuotaReservedon some worker" - but then, it was considered imperfect and evolved towards "let manager-side admission follow worker-side admission" (MultiKueue: workloads from worker clusters are deleted prematurely #8585).
In this context, introducingBestEffortseems to be a step in the opposite direction (and even more so making it default).
So I'd really like to understand the reason for this.
| - **Top-K nomination.** The current dispatcher nominates exactly one cluster per cycle | ||
| (top-1). A future KEP can add parallel nomination of K clusters simultaneously as an | ||
| opt-in `nominationMode: TopK` field on `ClusterDispatcherProfile`, reducing tail latency | ||
| when the top-scoring cluster is slow to admit. Top-1 remains the default. |
There was a problem hiding this comment.
Or, without introducing "nomination modes", just expose an *int for K, where nil means 1.
Cf. KEP-9270 which proposes a similar thing for the incremental dispatcher.
BTW I'd reconsider if adding this wouldn't be valuable right away.
How well can we predict if a worker will actually run the workload?
I think not too well, given preemptions, borrowing, ProvReqs etc.
(I mean, even with #10105 in place. That feature is just about some rough quota / usage stats; it won't solve any of the aspects mentioned above).
@olekzabl No worries! Just plopped it here even though it's very much a wip bc I wanted to let people know I'm thinking about it and really need it. I'll probably engage with this deeply towards the end of 0.19 / start of 0.20 release. Also, tbh I need to do more work on understanding the gaps that need to be filled in order for this kep to be possible. |
What type of PR is this?
THIS IS VERY MUCH A WIP. Made this PR so that people know that its being worked on.
/kind kepWhat this PR does / why we need it:
We need a composable dispatcher that allows for different routing policies. The overall goal is to treat clusters like nodes with concepts like filtering and scoring, and also node anti/affinity.
Which issue(s) this PR partially fixes:
Part of: #10766
Special notes for your reviewer:
its a WIP
Does this PR introduce a user-facing change?