Skip to content

MultiKueue: detect a hung remote in 1 minute, not 10#11304

Merged
k8s-ci-robot merged 2 commits into
kubernetes-sigs:mainfrom
trilamsr:fix/multikueue-exponential-watch-timeout
May 19, 2026
Merged

MultiKueue: detect a hung remote in 1 minute, not 10#11304
k8s-ci-robot merged 2 commits into
kubernetes-sigs:mainfrom
trilamsr:fix/multikueue-exponential-watch-timeout

Conversation

@trilamsr
Copy link
Copy Markdown
Contributor

@trilamsr trilamsr commented May 19, 2026

What type of PR is this?

/kind feature
/area multikueue

What this PR does / why we need it:

Follow up to #11207. Replaces the static 10 min watchEstablishTimeout with a schedule that starts at 1 min and doubles on each consecutive failedConnAttempts, capped at the existing 10 min.

Schedule: 1m, 2m, 4m, 8m, 10m, 10m, ...

failedConnAttempts already resets to zero on a successful establishment and on config change, so there is no new state. The new schedule also stacks naturally with the retryAfter(rc.failedConnAttempts) reconnect backoff from #10990: both grow together when the same remote keeps failing.

Why bother: under the static 10 min, a truly hung remote keeps a reconciler worker parked for 10 full minutes before the timeout fires. With the new schedule, the first attempt fails fast (1 min), so hung remotes get caught quickly while slow but recovering remotes still get the time they need.

Today Kueue's served and storage versions match (v1beta2), so the apiserver cold cache path that motivated the generous 10 min cap is not exercised. The cap stays as a guard against future version skew and unknown apiserver behavior (kubernetes/kubernetes#136950).

Which issue(s) this PR fixes:

Fixes #11303

Special notes for your reviewer:

  • Refactors establishWatch to take an explicit timeout parameter instead of reading from a package var. Tests now pass testTimeout directly, which is cleaner than the previous var mutation pattern.
  • The schedule comes from pkg/util/wait.NewBackoff (the kueue-wide helper) rather than a hand rolled function, per review feedback.
  • New TestEstablishBackoffSchedule pins the 1m/2m/4m/8m/10m schedule produced by our configured establishBackoff. The helper itself is tested upstream.
  • No behavior change for the success path or for already healthy clusters. Only the worst case (hung remote, no prior failures) changes: was 10 min, now 1 min.

Does this PR introduce a user-facing change?

MultiKueue: Fixed a bug where a hung watch connection to one remote cluster could block
reconciliation of other MultiKueueClusters, leaving them inactive and preventing workload
admission. Kueue now applies a circuit-breaking timeout while establishing remote-cluster
watches: the timeout starts at 1 minute and backs off exponentially on consecutive failures,
up to 10 minutes.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. area/multikueue Issues or PRs related to MultiKueue labels May 19, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 19, 2026

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit efccdfe
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a0c933c2238590008909a34

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 19, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @trilamsr. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 19, 2026
@k8s-ci-robot k8s-ci-robot requested review from kshalot and olekzabl May 19, 2026 01:54
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 19, 2026
@trilamsr trilamsr force-pushed the fix/multikueue-exponential-watch-timeout branch 2 times, most recently from f3b1bd2 to 7966361 Compare May 19, 2026 02:02
@trilamsr trilamsr changed the title MultiKueue: exponential watch establish timeout based on failedConnAttempts MultiKueue: detect a hung remote in 1 minute, not 10 May 19, 2026
@trilamsr
Copy link
Copy Markdown
Contributor Author

@mimowo here's the exponential timeout follow up I promised on #11207. Drops first attempt to 1 min so a hung remote gets caught fast, keeps the 10 min cap on retries via failedConnAttempts. WDYT?

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 19, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 19, 2026
@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 19, 2026

cc @kshalot @olekzabl ptal

Replace the static 10 min watchEstablishTimeout with a schedule that starts
at 1 min and doubles on each consecutive failedConnAttempt up to the existing
10 min cap. A healthy first connect catches a hung remote in 1 min instead
of waiting the full 10. A slow-but-recovering remote still gets the time it
needs because failedConnAttempts already drives the retryAfter backoff, so
the two scales grow together.

Schedule: 1m, 2m, 4m, 8m, 10m, 10m, ...

Refs kubernetes-sigs#11303.
@trilamsr trilamsr force-pushed the fix/multikueue-exponential-watch-timeout branch from 7966361 to a62fc92 Compare May 19, 2026 09:55
Comment on lines +101 to +107
func establishTimeoutFor(failedAttempts uint) time.Duration {
t := initialEstablishTimeout << min(failedAttempts, establishTimeoutMaxSteps)
if t > maxEstablishTimeout {
return maxEstablishTimeout
}
return t
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok-ish, but could we instead use our helper for exponential backoffs inside: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/util/wait/backoff.go#L63

I'm not saying the code is complex, but the helpers have already been tested pretty well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I would like to give the consistent picture of the codebase so that we don't re-implement the same thing over again. Re-implementation may be simple in this case, but might be more challenging in another place.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, even if not UntilWithBackoff, there's already a Backoff struct that calculates the timeout given the attempt:

// NewBackoff creates a Backoff calculator with the given parameters.
// If cap is zero, it defaults to math.MaxInt64 / Factor.
func NewBackoff(initial, cap time.Duration, factor, jitter float64) Backoff {
return Backoff{
backoff: wait.Backoff{
Duration: initial,
Factor: factor,
Jitter: jitter,
Steps: math.MaxInt,
Cap: cmp.Or(cap, time.Duration(math.MaxInt64/math.Ceil(factor))),
},
}
}
// WaitTime returns the backoff duration for the given iteration.
func (b Backoff) WaitTime(iteration int) time.Duration {
var duration time.Duration
for range iteration {
duration = b.backoff.Step()
if duration == b.backoff.Cap { // wait.Backoff caps at limit, no need to continue iterating.
break
}
}
return duration
}

In our case we'd have sth like:

b := NewBackoff(1m, 10m, 2, 0)
timeout := b.WaitTime(rc.failedConnAttempts)

Copy link
Copy Markdown
Contributor Author

@trilamsr trilamsr May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Swapped the establishTimeoutFor for wait.NewBackoff(initialEstablishTimeout, maxEstablishTimeout, 2, 0) and call establishBackoff.WaitTime(int(rc.failedConnAttempts)+1) at the call site. PTAL

Per review from @mimowo and @kshalot, swap the local establishTimeoutFor
function for pkg/util/wait.NewBackoff, which is the kueue-wide helper for
exponential backoffs. Same 1m/2m/4m/8m/10m schedule, less custom code.
@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 19, 2026

I'm tempted to merge this as a cherrypick as part of the fix for #11207, as this feels closely connected, wdyt @trilamsr ?

My reason is that this PR isn't really a feature on its own, it is just an improved version of the bugfix we have.

The previous PR hasn't been yet released so I would just turn its release-note to NONE, and describe the change here.

@trilamsr
Copy link
Copy Markdown
Contributor Author

I'm tempted to merge this as a cherrypick as part of the fix for #11207, as this feels closely connected, wdyt @trilamsr ?

My reason is that this PR isn't really a feature on its own, it is just an improved version of the bugfix we have.

The previous PR hasn't been yet released so I would just turn its release-note to NONE, and describe the change here.

Yeah, agreed. They're really one fix in two commits. Combined release note:

MultiKueue: a hung watch connection to one remote cluster no longer blocks reconciliation of other MultiKueueClusters. The controller bounds remote-watch establishment with an exponential timeout starting at 1 min and growing up to 10 min on consecutive failures, so a stuck remote is detected quickly while a slow but recovering one still gets the time it needs.

I'll flip #11207's note to NONE. /cherrypick release-0.17 release-0.16 here when you're ready?

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 19, 2026

/lgtm
/approve
Thank you for hardening the MultiKueue implementation 👍
/cherrypick release-0.17
/cherrypick release-0.16
/remove-kind feature
/kind bug

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

@mimowo: once the present PR merges, I will cherry-pick it on top of release-0.16, release-0.17 in new PRs and assign them to you.

Details

In response to this:

/lgtm
/approve
Thank you for hardening the MultiKueue implementation 👍
/cherrypick release-0.17
/cherrypick release-0.16
/remove-kind feature
/kind bug

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed kind/feature Categorizes issue or PR as related to a new feature. labels May 19, 2026
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 6b49a94b537757842411bcd06eb60d1f641b7fa3

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, trilamsr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2026
@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 19, 2026

Yeah, agreed. They're really one fix in two commits.

Ok, cool.

Combined release note:

That sounds ok, but let me propose something a bit more explictily indicating it is a bugfix:

/release-note-edit

MultiKueue: Fixed a bug where a hung watch connection to one remote cluster could block
reconciliation of other MultiKueueClusters, leaving them inactive and preventing workload
admission. Kueue now applies a circuit-breaking timeout while establishing remote-cluster
watches: the timeout starts at 1 minute and backs off exponentially on consecutive failures,
up to 10 minutes.

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 19, 2026

I'll flip #11207 note to NONE. /cherrypick release-0.17 release-0.16 here when you're ready?

Let me do that now.

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

@mimowo: new pull request created: #11328

Details

In response to this:

/lgtm
/approve
Thank you for hardening the MultiKueue implementation 👍
/cherrypick release-0.17
/cherrypick release-0.16
/remove-kind feature
/kind bug

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

@mimowo: new pull request created: #11329

Details

In response to this:

/lgtm
/approve
Thank you for hardening the MultiKueue implementation 👍
/cherrypick release-0.17
/cherrypick release-0.16
/remove-kind feature
/kind bug

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/multikueue Issues or PRs related to MultiKueue cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MultiKueue: hung remote takes 10 minutes to detect, should be 1

5 participants