MultiKueue: detect a hung remote in 1 minute, not 10#11304
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
Hi @trilamsr. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
f3b1bd2 to
7966361
Compare
|
/ok-to-test |
Replace the static 10 min watchEstablishTimeout with a schedule that starts at 1 min and doubles on each consecutive failedConnAttempt up to the existing 10 min cap. A healthy first connect catches a hung remote in 1 min instead of waiting the full 10. A slow-but-recovering remote still gets the time it needs because failedConnAttempts already drives the retryAfter backoff, so the two scales grow together. Schedule: 1m, 2m, 4m, 8m, 10m, 10m, ... Refs kubernetes-sigs#11303.
7966361 to
a62fc92
Compare
| func establishTimeoutFor(failedAttempts uint) time.Duration { | ||
| t := initialEstablishTimeout << min(failedAttempts, establishTimeoutMaxSteps) | ||
| if t > maxEstablishTimeout { | ||
| return maxEstablishTimeout | ||
| } | ||
| return t | ||
| } |
There was a problem hiding this comment.
This looks ok-ish, but could we instead use our helper for exponential backoffs inside: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/util/wait/backoff.go#L63
I'm not saying the code is complex, but the helpers have already been tested pretty well.
There was a problem hiding this comment.
Also, I would like to give the consistent picture of the codebase so that we don't re-implement the same thing over again. Re-implementation may be simple in this case, but might be more challenging in another place.
There was a problem hiding this comment.
+1, even if not UntilWithBackoff, there's already a Backoff struct that calculates the timeout given the attempt:
kueue/pkg/util/wait/backoff.go
Lines 33 to 58 in 91387fe
In our case we'd have sth like:
b := NewBackoff(1m, 10m, 2, 0)
timeout := b.WaitTime(rc.failedConnAttempts)There was a problem hiding this comment.
Done. Swapped the establishTimeoutFor for wait.NewBackoff(initialEstablishTimeout, maxEstablishTimeout, 2, 0) and call establishBackoff.WaitTime(int(rc.failedConnAttempts)+1) at the call site. PTAL
|
I'm tempted to merge this as a cherrypick as part of the fix for #11207, as this feels closely connected, wdyt @trilamsr ? My reason is that this PR isn't really a feature on its own, it is just an improved version of the bugfix we have. The previous PR hasn't been yet released so I would just turn its release-note to NONE, and describe the change here. |
Yeah, agreed. They're really one fix in two commits. Combined release note: I'll flip #11207's note to NONE. /cherrypick release-0.17 release-0.16 here when you're ready? |
|
/lgtm |
|
@mimowo: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
LGTM label has been added. DetailsGit tree hash: 6b49a94b537757842411bcd06eb60d1f641b7fa3 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, trilamsr The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Ok, cool.
That sounds ok, but let me propose something a bit more explictily indicating it is a bugfix: /release-note-edit |
Let me do that now. |
|
@mimowo: new pull request created: #11328 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@mimowo: new pull request created: #11329 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind feature
/area multikueue
What this PR does / why we need it:
Follow up to #11207. Replaces the static 10 min
watchEstablishTimeoutwith a schedule that starts at 1 min and doubles on each consecutivefailedConnAttempts, capped at the existing 10 min.Schedule: 1m, 2m, 4m, 8m, 10m, 10m, ...
failedConnAttemptsalready resets to zero on a successful establishment and on config change, so there is no new state. The new schedule also stacks naturally with theretryAfter(rc.failedConnAttempts)reconnect backoff from #10990: both grow together when the same remote keeps failing.Why bother: under the static 10 min, a truly hung remote keeps a reconciler worker parked for 10 full minutes before the timeout fires. With the new schedule, the first attempt fails fast (1 min), so hung remotes get caught quickly while slow but recovering remotes still get the time they need.
Today Kueue's served and storage versions match (v1beta2), so the apiserver cold cache path that motivated the generous 10 min cap is not exercised. The cap stays as a guard against future version skew and unknown apiserver behavior (kubernetes/kubernetes#136950).
Which issue(s) this PR fixes:
Fixes #11303
Special notes for your reviewer:
establishWatchto take an explicittimeoutparameter instead of reading from a package var. Tests now passtestTimeoutdirectly, which is cleaner than the previous var mutation pattern.pkg/util/wait.NewBackoff(the kueue-wide helper) rather than a hand rolled function, per review feedback.TestEstablishBackoffSchedulepins the 1m/2m/4m/8m/10m schedule produced by our configuredestablishBackoff. The helper itself is tested upstream.Does this PR introduce a user-facing change?