Skip to content

[release-0.16] MultiKueue: detect a hung remote in 1 minute, not 10#11329

Merged
k8s-ci-robot merged 2 commits into
kubernetes-sigs:release-0.16from
k8s-infra-cherrypick-robot:cherry-pick-11304-to-release-0.16
May 19, 2026
Merged

[release-0.16] MultiKueue: detect a hung remote in 1 minute, not 10#11329
k8s-ci-robot merged 2 commits into
kubernetes-sigs:release-0.16from
k8s-infra-cherrypick-robot:cherry-pick-11304-to-release-0.16

Conversation

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

This is an automated cherry-pick of #11304

/assign mimowo

MultiKueue: Fixed a bug where a hung watch connection to one remote cluster could block
reconciliation of other MultiKueueClusters, leaving them inactive and preventing workload
admission. Kueue now applies a circuit-breaking timeout while establishing remote-cluster
watches: the timeout starts at 1 minute and backs off exponentially on consecutive failures,
up to 10 minutes.

trilamsr added 2 commits May 19, 2026 17:11
Replace the static 10 min watchEstablishTimeout with a schedule that starts
at 1 min and doubles on each consecutive failedConnAttempt up to the existing
10 min cap. A healthy first connect catches a hung remote in 1 min instead
of waiting the full 10. A slow-but-recovering remote still gets the time it
needs because failedConnAttempts already drives the retryAfter backoff, so
the two scales grow together.

Schedule: 1m, 2m, 4m, 8m, 10m, 10m, ...

Refs kubernetes-sigs#11303.
Per review from @mimowo and @kshalot, swap the local establishTimeoutFor
function for pkg/util/wait.NewBackoff, which is the kueue-wide helper for
exponential backoffs. Same 1m/2m/4m/8m/10m schedule, less custom code.
@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label May 19, 2026
@k8s-ci-robot k8s-ci-robot added this to the v0.16 milestone May 19, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 19, 2026

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit 59f98d9
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a0c99af4e70660007d00560
😎 Deploy Preview https://deploy-preview-11329--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 19, 2026
@k8s-ci-robot k8s-ci-robot requested review from mimowo and pajakd May 19, 2026 17:11
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 19, 2026
@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 19, 2026

/kind bug
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 19, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: e590a05973842a84604c285ce2a485e59f8b29d0

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: k8s-infra-cherrypick-robot, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2026
@k8s-ci-robot k8s-ci-robot merged commit 3ca9964 into kubernetes-sigs:release-0.16 May 19, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants