Skip to content

MultiKueue: one slow remote no longer stalls every other cluster#11305

Merged
k8s-ci-robot merged 3 commits into
kubernetes-sigs:mainfrom
trilamsr:fix/multikueue-per-cluster-lock
May 19, 2026
Merged

MultiKueue: one slow remote no longer stalls every other cluster#11305
k8s-ci-robot merged 3 commits into
kubernetes-sigs:mainfrom
trilamsr:fix/multikueue-per-cluster-lock

Conversation

@trilamsr
Copy link
Copy Markdown
Contributor

@trilamsr trilamsr commented May 19, 2026

What type of PR is this?

/kind bug
/area multikueue

What this PR does / why we need it:

Fixes #11297. One slow or unresponsive remote used to stall reconciles for every other MultiKueueCluster because clustersReconciler.lock was held across the synchronous remote watch establishment in setConfig. Bumping GroupKindConcurrency did not help, since every worker ended up waiting on that same lock.

This narrows c.lock to just the remoteClients map find or insert, and adds a per cluster setConfigLock on remoteClient for the slow path. Different clusters now reconcile in parallel under their own locks.

Which issue(s) this PR fixes:

Fixes #11297

Special notes for your reviewer:

Test driven. TestSetRemoteClientConfigDoesNotBlockOtherClusters was written first and failed on main (hits the 2 second timeout). After the fix it passes in about 40ms. Race detector clean on the full multikueue suite.

stopAndRemoveCluster, controllerFor, and getRemoteClients already only hold c.lock briefly, so they did not need to change. Same key reconciles still serialize through the workqueue dedup contract, so setRemoteClientConfig and stopAndRemoveCluster cannot race for the same cluster name. The per cluster lock is mostly defensive.

Pairs with #11304 (exponential watch establish timeout). Same head of line scenario, different angle.

Does this PR introduce a user-facing change?

MultiKueue: Fixed a bug where one slow or unresponsive remote cluster could stall
reconciliation for other MultiKueueClusters, even when
`controller.groupKindConcurrency["MultiKueueCluster.kueue.x-k8s.io"]` was set above 1.
This could delay or block admission through other healthy clusters.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. area/multikueue Issues or PRs related to MultiKueue labels May 19, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 19, 2026

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit 4620ba5
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a0c9c6f560da70008c81a63
😎 Deploy Preview https://deploy-preview-11305--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 19, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @trilamsr. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot requested a review from mbobrovskyi May 19, 2026 02:14
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 19, 2026
@k8s-ci-robot k8s-ci-robot requested a review from mimowo May 19, 2026 02:14
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 19, 2026
@trilamsr trilamsr force-pushed the fix/multikueue-per-cluster-lock branch 2 times, most recently from 466ad52 to b1e0334 Compare May 19, 2026 02:18
@trilamsr trilamsr changed the title MultiKueue: narrow controller lock and serialize setConfig per cluster MultiKueue: one slow remote no longer stalls every other cluster May 19, 2026
@trilamsr
Copy link
Copy Markdown
Contributor Author

@mimowo took a shot at the lock refactor from #11297 since I was already in the file. TDD path: failing concurrency test on main (2 second head of line block), passes after narrowing the controller wide lock. Pairs with #11304, same head of line scenario from a different angle. Happy for you or anyone else to pick it apart.

@trilamsr trilamsr force-pushed the fix/multikueue-per-cluster-lock branch from b1e0334 to a6cd964 Compare May 19, 2026 04:04
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label May 19, 2026
@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 19, 2026

/ok-to-test
cc @kshalot @olekzabl ptal

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 19, 2026
@@ -497,8 +501,6 @@ func (c *clustersReconciler) stopAndRemoveCluster(clusterName string) {

func (c *clustersReconciler) setRemoteClientConfig(ctx context.Context, clusterName string, config *clientConfig, origin string) (*time.Duration, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's split this function into two, so that the we can use the c.lock.Lock(); defer c.lock.Unlock() for both of them

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 37cac3f. Factored the map find or insert out into findOrCreateRemoteClient, so both functions use the simple Lock(); defer Unlock() pattern. PTAL.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Copy Markdown
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, the code changes at the test look great. Left rather minor remarks to make the test more readable and future proof.


// Regression for #11297. A remote stuck inside Watch must not stop
// reconciles of other clusters.
func TestSetRemoteClientConfigDoesNotBlockOtherClusters(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test, thanks for adding 👍

slowDone := make(chan struct{})
go func() {
defer close(slowDone)
_, _ = reconciler.setRemoteClientConfig(ctx, "cluster-slow", slowConfig, defaultOrigin)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about using the Reconcile as top-level function for the verification. Maybe in the future we would like to start the watches from another place in code, so using the public interface which is never going to change will be more future proof for incoming refactorings.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in d0303a7. Switched the test to drive through Reconcile so it survives any future refactor of where watches get triggered from.


select {
case <-slowReached:
case <-time.After(2 * time.Second):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's name the magic consts for 2s to something meaningfull, like slowUserTimeout, or stuckWatchTimeout.

Same for other timeouts.

Let's also use 5s for the timeouts so that we don't risk the tests failing on a loaded CI.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in d0303a7. Named the budget stuckWatchTimeout = 5 * time.Second with a doc note about CI flakiness.

Copy link
Copy Markdown
Contributor

@kshalot kshalot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks for taking the time to fix this! I don't have any additional comments.

Comment on lines +522 to +523
client.setConfigLock.Lock()
defer client.setConfigLock.Unlock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, in theory this lock is redundant, because I think there's only one reconcile running per MultiKueueCluster at any given point, so in theory there's at most one thread accessing c.remoteClients[clusterName].

But this would be very implicit, so having this lock just in case sounds good to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the workqueue dedup makes it redundant in production. Kept as a local safety net because that guarantee is invisible from this file, and tests can bypass the workqueue (the one in this PR does).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I missed that. Having said that I'm ok with either option, both have some merit: explicit check + a bit more robust tests, vs. less production code.

I'm ok to keep as is.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 50d53a06495f8d3bb4d583f16f4a4890c2125ea8

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2026
@k8s-ci-robot k8s-ci-robot requested a review from kshalot May 19, 2026 16:47
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 19, 2026
@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 19, 2026

Trying to make the release note a bit more explicit that this is a bugfix:
/release-note-edit

MultiKueue: Fixed a bug where one slow or unresponsive remote cluster could stall
reconciliation for other MultiKueueClusters, even when
`controller.groupKindConcurrency["MultiKueueCluster.kueue.x-k8s.io"]` was set above 1.
This could delay or block admission through other healthy clusters.

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 19, 2026

/lgtm
/approve
Thank you for the quick follow up on this issue 👍 Blocking all reconciles on the lock wasn't good for sure.
/cherrypick release-0.17
/cherrypick release-0.16

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

@mimowo: once the present PR merges, I will cherry-pick it on top of release-0.16, release-0.17 in new PRs and assign them to you.

Details

In response to this:

/lgtm
/approve
Thank you for the quick follow up on this issue 👍 Blocking all reconciles on the lock wasn't good for sure.
/cherrypick release-0.17
/cherrypick release-0.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 5243cd5d6eb84fed5231150b08ca6e46da424fdd

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, trilamsr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2026
@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 19, 2026

This one requires rebase now

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 19, 2026
trilamsr added 3 commits May 19, 2026 10:22
The single controller wide write lock in setRemoteClientConfig used to be
held across the synchronous remote watch establishment in setConfig, which
can take minutes against an unresponsive or warming apiserver. While that
lock is held, every other MultiKueueCluster reconcile that calls
setRemoteClientConfig blocks, so one slow remote stalls admission for every
other cluster regardless of GroupKindConcurrency.

Narrow the controller wide lock to just the remoteClients map find or
insert, then serialize setConfig with a per cluster sync.Mutex on
remoteClient. Concurrent reconciles for different clusters now run in
parallel under their own locks. Same cluster reconciles still serialize
through the workqueue dedup contract; the per cluster lock is belt and
suspenders.

Adds TestSetRemoteClientConfigDoesNotBlockOtherClusters which pins the
property: while one cluster is parked inside its remote Watch call,
another cluster's setRemoteClientConfig still completes.

Refs 11297.
Per @mimowo's review, factor the map find or insert out of
setRemoteClientConfig so each function uses the simple
Lock(); defer Unlock() pattern instead of a manual unlock in the middle.
Per @mimowo's review on the test:
- use Reconcile as the entry point instead of calling
  setRemoteClientConfig directly, so the test stays valid if watches
  ever get triggered from a different path
- replace the magic 2s timeouts with a named stuckWatchTimeout = 5s
  so they survive a loaded CI runner
@trilamsr trilamsr force-pushed the fix/multikueue-per-cluster-lock branch from d0303a7 to 4620ba5 Compare May 19, 2026 17:22
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2026
@k8s-ci-robot k8s-ci-robot requested a review from mimowo May 19, 2026 17:22
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 19, 2026
@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 19, 2026

/lgtm
assuming the tests will pass

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: cbf3679a2052a4c3fcacc024026bd79443383a7d

@k8s-ci-robot k8s-ci-robot merged commit 22d936a into kubernetes-sigs:main May 19, 2026
37 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.18 milestone May 19, 2026
@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

@mimowo: new pull request created: #11332

Details

In response to this:

/lgtm
/approve
Thank you for the quick follow up on this issue 👍 Blocking all reconciles on the lock wasn't good for sure.
/cherrypick release-0.17
/cherrypick release-0.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

@mimowo: new pull request created: #11333

Details

In response to this:

/lgtm
/approve
Thank you for the quick follow up on this issue 👍 Blocking all reconciles on the lock wasn't good for sure.
/cherrypick release-0.17
/cherrypick release-0.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/multikueue Issues or PRs related to MultiKueue cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MultiKueue: GroupKindConcurrency has no effect because of a controller wide lock

5 participants