MultiKueue: one slow remote no longer stalls every other cluster#11305
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Hi @trilamsr. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
466ad52 to
b1e0334
Compare
|
@mimowo took a shot at the lock refactor from #11297 since I was already in the file. TDD path: failing concurrency test on |
b1e0334 to
a6cd964
Compare
| @@ -497,8 +501,6 @@ func (c *clustersReconciler) stopAndRemoveCluster(clusterName string) { | |||
|
|
|||
| func (c *clustersReconciler) setRemoteClientConfig(ctx context.Context, clusterName string, config *clientConfig, origin string) (*time.Duration, error) { | |||
There was a problem hiding this comment.
Let's split this function into two, so that the we can use the c.lock.Lock(); defer c.lock.Unlock() for both of them
There was a problem hiding this comment.
Done in 37cac3f. Factored the map find or insert out into findOrCreateRemoteClient, so both functions use the simple Lock(); defer Unlock() pattern. PTAL.
mimowo
left a comment
There was a problem hiding this comment.
Thank you, the code changes at the test look great. Left rather minor remarks to make the test more readable and future proof.
|
|
||
| // Regression for #11297. A remote stuck inside Watch must not stop | ||
| // reconciles of other clusters. | ||
| func TestSetRemoteClientConfigDoesNotBlockOtherClusters(t *testing.T) { |
There was a problem hiding this comment.
Nice test, thanks for adding 👍
| slowDone := make(chan struct{}) | ||
| go func() { | ||
| defer close(slowDone) | ||
| _, _ = reconciler.setRemoteClientConfig(ctx, "cluster-slow", slowConfig, defaultOrigin) |
There was a problem hiding this comment.
I'm wondering about using the Reconcile as top-level function for the verification. Maybe in the future we would like to start the watches from another place in code, so using the public interface which is never going to change will be more future proof for incoming refactorings.
There was a problem hiding this comment.
Done in d0303a7. Switched the test to drive through Reconcile so it survives any future refactor of where watches get triggered from.
|
|
||
| select { | ||
| case <-slowReached: | ||
| case <-time.After(2 * time.Second): |
There was a problem hiding this comment.
Let's name the magic consts for 2s to something meaningfull, like slowUserTimeout, or stuckWatchTimeout.
Same for other timeouts.
Let's also use 5s for the timeouts so that we don't risk the tests failing on a loaded CI.
There was a problem hiding this comment.
Done in d0303a7. Named the budget stuckWatchTimeout = 5 * time.Second with a doc note about CI flakiness.
kshalot
left a comment
There was a problem hiding this comment.
/lgtm
Thanks for taking the time to fix this! I don't have any additional comments.
| client.setConfigLock.Lock() | ||
| defer client.setConfigLock.Unlock() |
There was a problem hiding this comment.
IIUC, in theory this lock is redundant, because I think there's only one reconcile running per MultiKueueCluster at any given point, so in theory there's at most one thread accessing c.remoteClients[clusterName].
But this would be very implicit, so having this lock just in case sounds good to me.
There was a problem hiding this comment.
Agreed, the workqueue dedup makes it redundant in production. Kept as a local safety net because that guarantee is invisible from this file, and tests can bypass the workqueue (the one in this PR does).
There was a problem hiding this comment.
Good point, I missed that. Having said that I'm ok with either option, both have some merit: explicit check + a bit more robust tests, vs. less production code.
I'm ok to keep as is.
|
LGTM label has been added. DetailsGit tree hash: 50d53a06495f8d3bb4d583f16f4a4890c2125ea8 |
|
Trying to make the release note a bit more explicit that this is a bugfix: |
|
/lgtm |
|
@mimowo: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
LGTM label has been added. DetailsGit tree hash: 5243cd5d6eb84fed5231150b08ca6e46da424fdd |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, trilamsr The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
This one requires rebase now |
The single controller wide write lock in setRemoteClientConfig used to be held across the synchronous remote watch establishment in setConfig, which can take minutes against an unresponsive or warming apiserver. While that lock is held, every other MultiKueueCluster reconcile that calls setRemoteClientConfig blocks, so one slow remote stalls admission for every other cluster regardless of GroupKindConcurrency. Narrow the controller wide lock to just the remoteClients map find or insert, then serialize setConfig with a per cluster sync.Mutex on remoteClient. Concurrent reconciles for different clusters now run in parallel under their own locks. Same cluster reconciles still serialize through the workqueue dedup contract; the per cluster lock is belt and suspenders. Adds TestSetRemoteClientConfigDoesNotBlockOtherClusters which pins the property: while one cluster is parked inside its remote Watch call, another cluster's setRemoteClientConfig still completes. Refs 11297.
Per @mimowo's review, factor the map find or insert out of setRemoteClientConfig so each function uses the simple Lock(); defer Unlock() pattern instead of a manual unlock in the middle.
Per @mimowo's review on the test: - use Reconcile as the entry point instead of calling setRemoteClientConfig directly, so the test stays valid if watches ever get triggered from a different path - replace the magic 2s timeouts with a named stuckWatchTimeout = 5s so they survive a loaded CI runner
d0303a7 to
4620ba5
Compare
|
/lgtm |
|
LGTM label has been added. DetailsGit tree hash: cbf3679a2052a4c3fcacc024026bd79443383a7d |
|
@mimowo: new pull request created: #11332 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@mimowo: new pull request created: #11333 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind bug
/area multikueue
What this PR does / why we need it:
Fixes #11297. One slow or unresponsive remote used to stall reconciles for every other MultiKueueCluster because
clustersReconciler.lockwas held across the synchronous remote watch establishment insetConfig. BumpingGroupKindConcurrencydid not help, since every worker ended up waiting on that same lock.This narrows
c.lockto just theremoteClientsmap find or insert, and adds a per clustersetConfigLockonremoteClientfor the slow path. Different clusters now reconcile in parallel under their own locks.Which issue(s) this PR fixes:
Fixes #11297
Special notes for your reviewer:
Test driven.
TestSetRemoteClientConfigDoesNotBlockOtherClusterswas written first and failed onmain(hits the 2 second timeout). After the fix it passes in about 40ms. Race detector clean on the full multikueue suite.stopAndRemoveCluster,controllerFor, andgetRemoteClientsalready only holdc.lockbriefly, so they did not need to change. Same key reconciles still serialize through the workqueue dedup contract, sosetRemoteClientConfigandstopAndRemoveClustercannot race for the same cluster name. The per cluster lock is mostly defensive.Pairs with #11304 (exponential watch establish timeout). Same head of line scenario, different angle.
Does this PR introduce a user-facing change?