Move AllAtOnce MultiKueue dispatcher to a dedicated controller#10937
Move AllAtOnce MultiKueue dispatcher to a dedicated controller#10937andrewseif wants to merge 18 commits into
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Hi @andrewseif. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: andrewseif The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/ok-to-test |
109b30d to
fb7c981
Compare
mimowo
left a comment
There was a problem hiding this comment.
Thank you for the effort 👍
|
I had to add some logic to watch the config and clusters, as these were free in the inline version, but the logic itself is mostly similar |
|
/retest |
|
cc @Singularity23x0 @olekzabl @kshalot ptal |
| if workload.IsEvicted(remoteWl) { | ||
| log.V(3).Info("Preserving evicted remote workload to allow eviction-recovery sync", "remote", rem) | ||
| continue | ||
| } |
There was a problem hiding this comment.
Interesting does it mean we have a bug for the Increamental dispatcher which is already extracted? If this is the case maybe we could start by showing the bug, and fixing it here. It will also make the PRs more dijestable by the split.
There was a problem hiding this comment.
I wouldn't say it was a bug, but there was a race window that was apparent, when I extracted the AllAtOnce MK dispatcher, ill make sure to post the findings for verification in wg-batch for more experienced eyes to verify my findings
There was a problem hiding this comment.
I have trouble following the issue description in the comment here.
Mostly because, in the handling of Evicted condition which you mentioned (I suppose, here), the only explicit call to SyncJob is here, i.e. in the case of a manager-originating eviction, while your case seems to be the other one (the worker-originating eviction, dealt with here), given that you care whether the manager will notice.
This might be not yet contradictory; maybe you've found a longer path (a ping-pong across a few reconcilers?) leading to calling SyncJob also in the worker-originating case?
But anyway, my bottom line is:
- +1 to documenting this as a separate issue #N
- and then, instead of summarizing that issue in a comment here, I'd just leave a link to #N, because:
- even a several-lines summary can be hard to follow (as I'm right now experiencing)
- an even longer summary does not feel fit in this place
- #N will act as a place where we can further discuss (while such comments are more "frozen")
There was a problem hiding this comment.
I have sent you my investigation findings, maybe you can verify them, I believe moving the AllAtOnce dispatcher created this, and it should be part of the PR, as without it the program won't function properly.
There was a problem hiding this comment.
I wouldn't say it was a bug, but there was a race window that was apparent, when I extracted the AllAtOnce MK dispatcher, ill make sure to post the findings for verification in wg-batch for more experienced eyes to verify my findings
@andrewseif actually most of "races" are bugs, so I would like to understand which test and how is failing. It very well might be that you have discovered a bug we should extract to a separate preparatory bugfix PR we should cherrypick.
In order to let us understand what is the race I would recommend that you temporarily revert (or comment out the code) so that we can see what is the failure, and analyze it. Then we can make an informed decision if this is a separate bugfix or part of this PR.
|
@andrewseif I 'm planning to conclude the review next week, overall it looks great, but another pair of eyes from MK experts (@olekzabl or @kshalot ) would be great. Thank you for the effort once again 👍 |
olekzabl
left a comment
There was a problem hiding this comment.
First of all, thank you for doing this!
Then, even before going into detailed comments, I'm feeling I should start from a "provoking" question:
The incremental dispatcher is currently being parametrized with step size (#10877).
Given that, what if we "implemented" the "externalized" AllAtOnce just as a special case of that?
standaloneAllAtOnce := incrementalDispatcher{stepSize: 1000000}
You could say that's too simplistic, wasting performance etc. Because incremental dispatcher contains some bits of logic that we don't need. Perhaps.
But even if so, I'd like to ask how much of them we can have in common. Maybe extracting some shared pieces. Or maybe a central shared entry point with injectable per-case callbacks. IDK yet.
I'm just intuitively afraid of nearly-duplicating ~200 lines of code which may then diverge without a good reason. (And, in my eyes, this "unjustified divergence" shows up already in this PR. See my detailed comments).
I haven't yet read everything but must pause now. Will come back later.
| if workload.IsEvicted(remoteWl) { | ||
| log.V(3).Info("Preserving evicted remote workload to allow eviction-recovery sync", "remote", rem) | ||
| continue | ||
| } |
There was a problem hiding this comment.
I have trouble following the issue description in the comment here.
Mostly because, in the handling of Evicted condition which you mentioned (I suppose, here), the only explicit call to SyncJob is here, i.e. in the case of a manager-originating eviction, while your case seems to be the other one (the worker-originating eviction, dealt with here), given that you care whether the manager will notice.
This might be not yet contradictory; maybe you've found a longer path (a ping-pong across a few reconcilers?) leading to calling SyncJob also in the worker-originating case?
But anyway, my bottom line is:
- +1 to documenting this as a separate issue #N
- and then, instead of summarizing that issue in a comment here, I'd just leave a link to #N, because:
- even a several-lines summary can be hard to follow (as I'm right now experiencing)
- an even longer summary does not feel fit in this place
- #N will act as a place where we can further discuss (while such comments are more "frozen")
| return reconcile.Result{}, nil | ||
| } | ||
|
|
||
| // The workload is already assigned to a cluster, no need to nominate workers. |
There was a problem hiding this comment.
Nit: this comment feels not very useful, given that it's duplicated in the log text just below.
(Though I'm aware that it looks so also in incrementaldispatcher.go).
| // filterActiveClusters returns the subset of remoteClusters whose MultiKueueCluster | ||
| // has the MultiKueueClusterActive condition set to True. Clusters that are missing | ||
| // or not active are excluded so they are not nominated for workload placement. | ||
| func (r *AllAtOnceDispatcherReconciler) filterActiveClusters(ctx context.Context, remoteClusters sets.Set[string]) (sets.Set[string], error) { | ||
| active := sets.New[string]() | ||
| for clusterName := range remoteClusters { | ||
| cluster := &kueue.MultiKueueCluster{} | ||
| if err := r.client.Get(ctx, types.NamespacedName{Name: clusterName}, cluster); err != nil { | ||
| if client.IgnoreNotFound(err) != nil { | ||
| return nil, err | ||
| } | ||
| // Missing cluster: skip. | ||
| continue | ||
| } | ||
| if apimeta.IsStatusConditionTrue(cluster.Status.Conditions, kueue.MultiKueueClusterActive) { | ||
| active.Insert(clusterName) | ||
| } | ||
| } | ||
| return active, nil | ||
| } |
There was a problem hiding this comment.
This whole filterActiveClusters seems to be mostly an optimization that you've added "by the way"?
Looking at the original logic, IIUC, the set of nominated workers was based just on group.remotes (here), that in turn based on this call. Digging deeper, I didn't see anything like checking MultiKueueClusterActive. I guess it makes sense, true, but I'd vote for separating optimizations from refactors (I mean, into separate PRs).
For this refactoring PR, I'd consider ways to inject wlReconciler into this reconciler and just call its remoteClientsForAc method. (There are precedents, e.g. wlReconciler knows the clustersReconciler, here). Not necessarily strictly this way, but sth like this, to reduce duplication of code.
Then, in a follow-up PR, you're welcome to add this smarter filtering. But maybe not only here? Maybe the other dispatchers could also benefit from that?
WDYT?
There was a problem hiding this comment.
This was not intended as smart filtering.
it was added because the integration test test/integration/multikueue/setup_test.go:L752 was failing on the refactor branch. That test ("Should properly detect insecure kubeconfig of MultiKueueClusters and remove remote client") explicitly asserts that an inactive cluster does not appear in Status.NominatedClusterNames.
The old code passed this test structurally because nominations came from group.remotes (in-process map maintained by clustersReconciler), which never contains a disconnected cluster. The new dispatcher pulls from admissioncheck.GetRemoteClusters() which returns configured clusters regardless of activity, so without the filter that test fails.
There was a problem hiding this comment.
That's correct, it only proves how embedded AllAtOnce was, good catch @andrewseif
| } | ||
|
|
||
| func (r *AllAtOnceDispatcherReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { | ||
| log := ctrl.LoggerFrom(ctx) |
There was a problem hiding this comment.
AFAICS the first 40 lines of this method are almost identical as in incrementaldispatcher.go.
This raises 2 questions:
-
Can this be unified?
The differences seem not very blocking - they're only aboutr.clearRoundStartTimein the incr dispatcher; this could be passed as a callback "what should the common code do with a troubling error". -
The first real difference is your newly-added special handling of eviction.
Though then - maybe it'd make sense to add it to the incremental dispatcher as well?
(Hence, again, I'd prefer to deal with it in a dedicated issue, and a dedicated fixing PR, separate from refactoring).
|
I apologize @andrewseif , I must declare bankruptcy on this PR, at least until Wednesday.
Well... there still are some comments from me which you haven't responded to? |
|
/test pull-kueue-priority-booster-test-integration-main |
|
[pull-kueue-verify-main] is a small linter fix I need to deploy, for the open comments, I think the investigation answers most of them, except the architectural one, specifically this
I think this might need to be discussed in our wg-batch meeting. I can answer from a software design perspective, but I am not sure I can answer from a kueue architecture direction, which I think @olekzabl is referring to |
… incremental one.
…until job reconciler runs
…id race condition windows
…licate the log error
Co-authored-by: Olek Zabłocki <olekz@google.com>
Co-authored-by: Olek Zabłocki <olekz@google.com>
Co-authored-by: Olek Zabłocki <olekz@google.com>
Co-authored-by: Olek Zabłocki <olekz@google.com>
4b13c9b to
83f6a2b
Compare
My intent basically is to reduce the divergences between AllAtOnce and Incremental, because such divergences - especially if not clear at the first glance - feel like a risk of having some issues on one of the sides. I consider it as a software engineering healthy practice, rather than "Kueue architecture". |
I totally agree there is a lot of duplication between the two dispatchers, and we should commonize the code. Additionally, the new dispatcher seems to do a much better job by tracking changes to MultiKueueConfig, which the previous didn't do - and I think this is a bug in the incrementaldispatcher which just didn't notice. We could get the bug for incremental dispatcher fixed for free if we commonized the code. On "how to" commonize the code I'm wondering architecturally what is better - and I'm not sure:
I'm leaning towards (1.) as this avoids duplication for the event handlers. I thought an argument for (2.) could be load separation, so similarly as @olekzabl suggested, but I just wouldn't say "Incremental with N = 1000000", but GenericDispatcher which supports both modes. Let me know @andrewseif if this makes sense. |
|
@andrewseif I read through your investigation, it appears to be correct. |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@andrewseif: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
/kind cleanup
/area multikueue
What this PR does / why we need it:
Move AllAtOnce MultiKueue dispatcher to a dedicated controller
Which issue(s) this PR fixes:
Fixes #6803
Special notes for your reviewer:
Does this PR introduce a user-facing change?