Skip to content

[release-0.17] MultiKueue: one slow remote no longer stalls every other cluster#11333

Merged
k8s-ci-robot merged 3 commits into
kubernetes-sigs:release-0.17from
k8s-infra-cherrypick-robot:cherry-pick-11305-to-release-0.17
May 19, 2026
Merged

[release-0.17] MultiKueue: one slow remote no longer stalls every other cluster#11333
k8s-ci-robot merged 3 commits into
kubernetes-sigs:release-0.17from
k8s-infra-cherrypick-robot:cherry-pick-11305-to-release-0.17

Conversation

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

This is an automated cherry-pick of #11305

/assign mimowo

MultiKueue: Fixed a bug where one slow or unresponsive remote cluster could stall
reconciliation for other MultiKueueClusters, even when
`controller.groupKindConcurrency["MultiKueueCluster.kueue.x-k8s.io"]` was set above 1.
This could delay or block admission through other healthy clusters.

trilamsr added 3 commits May 19, 2026 17:51
The single controller wide write lock in setRemoteClientConfig used to be
held across the synchronous remote watch establishment in setConfig, which
can take minutes against an unresponsive or warming apiserver. While that
lock is held, every other MultiKueueCluster reconcile that calls
setRemoteClientConfig blocks, so one slow remote stalls admission for every
other cluster regardless of GroupKindConcurrency.

Narrow the controller wide lock to just the remoteClients map find or
insert, then serialize setConfig with a per cluster sync.Mutex on
remoteClient. Concurrent reconciles for different clusters now run in
parallel under their own locks. Same cluster reconciles still serialize
through the workqueue dedup contract; the per cluster lock is belt and
suspenders.

Adds TestSetRemoteClientConfigDoesNotBlockOtherClusters which pins the
property: while one cluster is parked inside its remote Watch call,
another cluster's setRemoteClientConfig still completes.

Refs 11297.
Per @mimowo's review, factor the map find or insert out of
setRemoteClientConfig so each function uses the simple
Lock(); defer Unlock() pattern instead of a manual unlock in the middle.
Per @mimowo's review on the test:
- use Reconcile as the entry point instead of calling
  setRemoteClientConfig directly, so the test stays valid if watches
  ever get triggered from a different path
- replace the magic 2s timeouts with a named stuckWatchTimeout = 5s
  so they survive a loaded CI runner
@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label May 19, 2026
@k8s-ci-robot k8s-ci-robot added this to the v0.17 milestone May 19, 2026
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 19, 2026
@k8s-ci-robot k8s-ci-robot requested review from PBundyra and tenzen-y May 19, 2026 17:51
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 19, 2026
@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 19, 2026

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: ff7d3ef76938ff7659f8880b645922c2bf040912

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: k8s-infra-cherrypick-robot, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2026
@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 19, 2026

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 19, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 19, 2026

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit 03a16b9
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a0ca30ca0d4cd000813632d
😎 Deploy Preview https://deploy-preview-11333--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot merged commit 2889aa3 into kubernetes-sigs:release-0.17 May 19, 2026
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants