MultiKueue: one slow remote no longer stalls every other cluster by trilamsr · Pull Request #11305 · kubernetes-sigs/kueue

trilamsr · 2026-05-19T02:14:24Z

What type of PR is this?

/kind bug
/area multikueue

What this PR does / why we need it:

Fixes #11297. One slow or unresponsive remote used to stall reconciles for every other MultiKueueCluster because clustersReconciler.lock was held across the synchronous remote watch establishment in setConfig. Bumping GroupKindConcurrency did not help, since every worker ended up waiting on that same lock.

This narrows c.lock to just the remoteClients map find or insert, and adds a per cluster setConfigLock on remoteClient for the slow path. Different clusters now reconcile in parallel under their own locks.

Which issue(s) this PR fixes:

Fixes #11297

Special notes for your reviewer:

Test driven. TestSetRemoteClientConfigDoesNotBlockOtherClusters was written first and failed on main (hits the 2 second timeout). After the fix it passes in about 40ms. Race detector clean on the full multikueue suite.

stopAndRemoveCluster, controllerFor, and getRemoteClients already only hold c.lock briefly, so they did not need to change. Same key reconciles still serialize through the workqueue dedup contract, so setRemoteClientConfig and stopAndRemoveCluster cannot race for the same cluster name. The per cluster lock is mostly defensive.

Pairs with #11304 (exponential watch establish timeout). Same head of line scenario, different angle.

Does this PR introduce a user-facing change?

MultiKueue: Fixed a bug where one slow or unresponsive remote cluster could stall
reconciliation for other MultiKueueClusters, even when
`controller.groupKindConcurrency["MultiKueueCluster.kueue.x-k8s.io"]` was set above 1.
This could delay or block admission through other healthy clusters.

netlify · 2026-05-19T02:14:29Z

✅ Deploy Preview for kubernetes-sigs-kueue ready!

Name	Link
🔨 Latest commit	`4620ba5`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a0c9c6f560da70008c81a63
😎 Deploy Preview	https://deploy-preview-11305--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2026-05-19T02:14:34Z

Hi @trilamsr. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

trilamsr · 2026-05-19T02:28:42Z

@mimowo took a shot at the lock refactor from #11297 since I was already in the file. TDD path: failing concurrency test on main (2 second head of line block), passes after narrowing the controller wide lock. Pairs with #11304, same head of line scenario from a different angle. Happy for you or anyone else to pick it apart.

mimowo · 2026-05-19T06:47:30Z

/ok-to-test
cc @kshalot @olekzabl ptal

mimowo · 2026-05-19T06:48:55Z

@@ -497,8 +501,6 @@ func (c *clustersReconciler) stopAndRemoveCluster(clusterName string) {

 func (c *clustersReconciler) setRemoteClientConfig(ctx context.Context, clusterName string, config *clientConfig, origin string) (*time.Duration, error) {


Let's split this function into two, so that the we can use the c.lock.Lock(); defer c.lock.Unlock() for both of them

Done in 37cac3f. Factored the map find or insert out into findOrCreateRemoteClient, so both functions use the simple Lock(); defer Unlock() pattern. PTAL.

mimowo

Thank you, the code changes at the test look great. Left rather minor remarks to make the test more readable and future proof.

mimowo · 2026-05-19T15:47:38Z

+
+// Regression for #11297. A remote stuck inside Watch must not stop
+// reconciles of other clusters.
+func TestSetRemoteClientConfigDoesNotBlockOtherClusters(t *testing.T) {


Nice test, thanks for adding 👍

mimowo · 2026-05-19T15:49:52Z

+	slowDone := make(chan struct{})
+	go func() {
+		defer close(slowDone)
+		_, _ = reconciler.setRemoteClientConfig(ctx, "cluster-slow", slowConfig, defaultOrigin)


I'm wondering about using the Reconcile as top-level function for the verification. Maybe in the future we would like to start the watches from another place in code, so using the public interface which is never going to change will be more future proof for incoming refactorings.

Done in d0303a7. Switched the test to drive through Reconcile so it survives any future refactor of where watches get triggered from.

mimowo · 2026-05-19T15:51:30Z

+
+	select {
+	case <-slowReached:
+	case <-time.After(2 * time.Second):


Let's name the magic consts for 2s to something meaningfull, like slowUserTimeout, or stuckWatchTimeout.

Same for other timeouts.

Let's also use 5s for the timeouts so that we don't risk the tests failing on a loaded CI.

Done in d0303a7. Named the budget stuckWatchTimeout = 5 * time.Second with a doc note about CI flakiness.

kshalot

/lgtm

Thanks for taking the time to fix this! I don't have any additional comments.

kshalot · 2026-05-19T16:03:13Z

+	client.setConfigLock.Lock()
+	defer client.setConfigLock.Unlock()


IIUC, in theory this lock is redundant, because I think there's only one reconcile running per MultiKueueCluster at any given point, so in theory there's at most one thread accessing c.remoteClients[clusterName].

But this would be very implicit, so having this lock just in case sounds good to me.

Agreed, the workqueue dedup makes it redundant in production. Kept as a local safety net because that guarantee is invisible from this file, and tests can bypass the workqueue (the one in this PR does).

Good point, I missed that. Having said that I'm ok with either option, both have some merit: explicit check + a bit more robust tests, vs. less production code.

I'm ok to keep as is.

k8s-ci-robot · 2026-05-19T16:12:07Z

LGTM label has been added.

Details

Git tree hash: 50d53a06495f8d3bb4d583f16f4a4890c2125ea8

mimowo · 2026-05-19T16:58:43Z

Trying to make the release note a bit more explicit that this is a bugfix:
/release-note-edit

MultiKueue: Fixed a bug where one slow or unresponsive remote cluster could stall
reconciliation for other MultiKueueClusters, even when
`controller.groupKindConcurrency["MultiKueueCluster.kueue.x-k8s.io"]` was set above 1.
This could delay or block admission through other healthy clusters.

mimowo · 2026-05-19T16:59:49Z

/lgtm
/approve
Thank you for the quick follow up on this issue 👍 Blocking all reconciles on the lock wasn't good for sure.
/cherrypick release-0.17
/cherrypick release-0.16

k8s-infra-cherrypick-robot · 2026-05-19T16:59:52Z

@mimowo: once the present PR merges, I will cherry-pick it on top of release-0.16, release-0.17 in new PRs and assign them to you.

Details

In response to this:

/lgtm
/approve
Thank you for the quick follow up on this issue 👍 Blocking all reconciles on the lock wasn't good for sure.
/cherrypick release-0.17
/cherrypick release-0.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-05-19T16:59:57Z

LGTM label has been added.

Details

Git tree hash: 5243cd5d6eb84fed5231150b08ca6e46da424fdd

k8s-ci-robot · 2026-05-19T16:59:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, trilamsr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [mimowo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mimowo · 2026-05-19T17:22:01Z

This one requires rebase now

The single controller wide write lock in setRemoteClientConfig used to be held across the synchronous remote watch establishment in setConfig, which can take minutes against an unresponsive or warming apiserver. While that lock is held, every other MultiKueueCluster reconcile that calls setRemoteClientConfig blocks, so one slow remote stalls admission for every other cluster regardless of GroupKindConcurrency. Narrow the controller wide lock to just the remoteClients map find or insert, then serialize setConfig with a per cluster sync.Mutex on remoteClient. Concurrent reconciles for different clusters now run in parallel under their own locks. Same cluster reconciles still serialize through the workqueue dedup contract; the per cluster lock is belt and suspenders. Adds TestSetRemoteClientConfigDoesNotBlockOtherClusters which pins the property: while one cluster is parked inside its remote Watch call, another cluster's setRemoteClientConfig still completes. Refs 11297.

@mimowo

Per @mimowo's review, factor the map find or insert out of setRemoteClientConfig so each function uses the simple Lock(); defer Unlock() pattern instead of a manual unlock in the middle.

@mimowo

Per @mimowo's review on the test: - use Reconcile as the entry point instead of calling setRemoteClientConfig directly, so the test stays valid if watches ever get triggered from a different path - replace the magic 2s timeouts with a named stuckWatchTimeout = 5s so they survive a loaded CI runner

mimowo · 2026-05-19T17:25:39Z

/lgtm
assuming the tests will pass

k8s-ci-robot · 2026-05-19T17:25:54Z

LGTM label has been added.

Details

Git tree hash: cbf3679a2052a4c3fcacc024026bd79443383a7d

k8s-infra-cherrypick-robot · 2026-05-19T17:50:29Z

@mimowo: new pull request created: #11332

Details

In response to this:

/lgtm
/approve
Thank you for the quick follow up on this issue 👍 Blocking all reconciles on the lock wasn't good for sure.
/cherrypick release-0.17
/cherrypick release-0.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-infra-cherrypick-robot · 2026-05-19T17:51:06Z

@mimowo: new pull request created: #11333

Details

In response to this:

/lgtm
/approve
Thank you for the quick follow up on this issue 👍 Blocking all reconciles on the lock wasn't good for sure.
/cherrypick release-0.17
/cherrypick release-0.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 19, 2026

k8s-ci-robot requested a review from mbobrovskyi May 19, 2026 02:14

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 19, 2026

k8s-ci-robot requested a review from mimowo May 19, 2026 02:14

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 19, 2026

trilamsr force-pushed the fix/multikueue-per-cluster-lock branch 2 times, most recently from 466ad52 to b1e0334 Compare May 19, 2026 02:18

trilamsr changed the title ~~MultiKueue: narrow controller lock and serialize setConfig per cluster~~ MultiKueue: one slow remote no longer stalls every other cluster May 19, 2026

trilamsr force-pushed the fix/multikueue-per-cluster-lock branch from b1e0334 to a6cd964 Compare May 19, 2026 04:04

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label May 19, 2026

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 19, 2026

mimowo reviewed May 19, 2026

View reviewed changes

kshalot reviewed May 19, 2026

View reviewed changes

k8s-ci-robot assigned kshalot May 19, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2026

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2026

k8s-ci-robot requested a review from kshalot May 19, 2026 16:47

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 19, 2026

k8s-ci-robot assigned mimowo May 19, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2026

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 19, 2026

trilamsr added 3 commits May 19, 2026 10:22

Split out findOrCreateRemoteClient per review

1a0607f

Per @mimowo's review, factor the map find or insert out of setRemoteClientConfig so each function uses the simple Lock(); defer Unlock() pattern instead of a manual unlock in the middle.

trilamsr force-pushed the fix/multikueue-per-cluster-lock branch from d0303a7 to 4620ba5 Compare May 19, 2026 17:22

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2026

k8s-ci-robot requested a review from mimowo May 19, 2026 17:22

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 19, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2026

k8s-ci-robot merged commit 22d936a into kubernetes-sigs:main May 19, 2026
37 checks passed

k8s-ci-robot added this to the v0.18 milestone May 19, 2026

k8s-infra-cherrypick-robot mentioned this pull request May 19, 2026

[release-0.16] MultiKueue: one slow remote no longer stalls every other cluster #11332

Merged

k8s-infra-cherrypick-robot mentioned this pull request May 19, 2026

[release-0.17] MultiKueue: one slow remote no longer stalls every other cluster #11333

Merged

		@@ -497,8 +501,6 @@ func (c *clustersReconciler) stopAndRemoveCluster(clusterName string) {

		func (c clustersReconciler) setRemoteClientConfig(ctx context.Context, clusterName string, config clientConfig, origin string) (*time.Duration, error) {

		client.setConfigLock.Lock()
		defer client.setConfigLock.Unlock()

Conversation

trilamsr commented May 19, 2026 • edited by k8s-ci-robot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

netlify Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-kueue ready!

Uh oh!

k8s-ci-robot commented May 19, 2026

Uh oh!

trilamsr commented May 19, 2026

Uh oh!

mimowo commented May 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mimowo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kshalot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented May 19, 2026

Uh oh!

mimowo commented May 19, 2026

Uh oh!

mimowo commented May 19, 2026

Uh oh!

k8s-infra-cherrypick-robot commented May 19, 2026

Uh oh!

k8s-ci-robot commented May 19, 2026

Uh oh!

k8s-ci-robot commented May 19, 2026

Uh oh!

mimowo commented May 19, 2026

Uh oh!

mimowo commented May 19, 2026

Uh oh!

k8s-ci-robot commented May 19, 2026

Uh oh!

Uh oh!

k8s-infra-cherrypick-robot commented May 19, 2026

Uh oh!

k8s-infra-cherrypick-robot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

trilamsr commented May 19, 2026 •

edited by k8s-ci-robot

Loading

netlify Bot commented May 19, 2026 •

edited

Loading