MultiKueue: prevent a hung remote watch from stopping all-cluster admission#11207
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
Hi @trilamsr. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
10565ed to
89b9ab3
Compare
|
/ok-to-test |
kshalot
left a comment
There was a problem hiding this comment.
/lgtm
I'm thinking about e2e testing this, although it might be tricky to simulate in an easy way (I was thinking of maybe using DROP in iptables).
|
LGTM label has been added. DetailsGit tree hash: 4f3aea30d663ca56ae7340e9771ba034777feaca |
|
Thanks @kshalot!
|
| // Bounds how long startWatcher waits for client.Watch() to return, | ||
| // so a hung remote cannot head-of-line block the reconciler. See #11206. |
There was a problem hiding this comment.
Exploratory question to confirm my understanding: could this issue be also mitigated by increasing the GroupKindConcurrency level for MultiKueueCluster for the controller, wdyt? Code pointer to the configuration: https://github.com/kubernetes-sigs/kueue/blob/main/apis/config/v1beta2/configuration_types.go#L264C2-L264C22
In that case when one Reconcile is stuck the other may continue
There was a problem hiding this comment.
could this issue be also mitigated by increasing the
GroupKindConcurrencylevel…
I don't think it actually buys us anything in this code path. The knob is wired through correctly (the controller doesn't override MaxConcurrentReconciles), so in principle you'd get N parallel workers. But Reconcile calls setRemoteClientConfig, which grabs clustersReconciler.lock (a single controller wide write mutex at multikueuecluster.go:394) and holds it through the entire setConfig → establishWatch → c.Watch() chain. So when one worker is parked inside c.Watch(), every other Reconcile that wants to do anything with a different cluster sits waiting on that same lock.
In practice concurrency=N collapses back to 1 whenever any remote is hung. Bumping the knob would only help after we also split that lock per cluster, or released it before establishing the watch. Probably a separate PR. Happy to file a follow up issue for it if you want.
There was a problem hiding this comment.
I see, thank you for the summary, I now understand why this would not work currently, and why fixing the problem is not straighforward.
Having said that, I think this renders GroupKindConcurrency useless for MultiKueueCluster due to very technical reasons which can be considered a bug by a user.
So, please file an issue for that. Feel free to work on it, or just keep it posted for some contributors.
There was a problem hiding this comment.
Filed #11297 to track the lock refactor. Leaving it unassigned for now so it's open for any contributor to pick up.
|
|
||
| // Bounds how long startWatcher waits for client.Watch() to return, | ||
| // so a hung remote cannot head-of-line block the reconciler. See #11206. | ||
| defaultWatchEstablishTimeout = 60 * time.Second |
There was a problem hiding this comment.
I'm wondering if the timeout at 60s is safe in practice. IIUC this watch is used for all types, that includes Workloads, which might be numerous. And when Watch is opened IIUC all workloads are Listed, which may take a lot of time at the api-server side to prepare the response, in particular if the Workload objects need to go via conversion webhooks.
Recently we had an issue in Kueue (single cluster) that at scale of 50k workloads when conversion webhooks are running it may take around 8min, whilst the etcd compaction timeout is only around 2.5min. The test we are working on here is to prevent that in the future: #9145
Also, it should not be that long when conversion webhooks are not necessary.
If my suspection is correct then maybe we should:
- use increased timeout for Workload objects (a bit hacky but maybe works well in practice (say 10min)
- use exponential timeout which is increased on retries (we already have the retry mechanism recently improved here: Add reconnect backoff guardrail to suppress redundant cluster reconciles #10990
There was a problem hiding this comment.
OTOH, in this particular case, will it actually list all the workloads? I think here it will just make na HTTP call, get the headers and then the events will start streaming (IIUC it will stream an ADDED event for every resource because we are not passing any resource version in the options).
AFAIR the 50k workloads issue happened because the Informer first does a List and then a Watch.
There was a problem hiding this comment.
OTOH, in this particular case, will it actually list all the workloads? I think here it will just make na HTTP call, get the headers and then the events will start streaming (IIUC it will stream an ADDED event for every resource because we are not passing any resource version in the options).
Possible, I'm not sure about the exact semantics, if this is the case then we are good.
Let's research that more and support the claim (that opening the watch does not entail blocking on listing the workloads) with references or experiments.
There was a problem hiding this comment.
Let's research that more…
Did some digging. TLDR: @kshalot is right on the client side, @mimowo is right on the server side, and the server side is the one that bites.
Client side: traced controller-runtime → client-go → newStreamWatcher. c.Watch() returns as soon as HTTP 200 headers arrive; there's no client-side List, no full-body buffering. So the 50k-workloads Informer scenario doesn't transfer directly. (Happy to drop in file:line refs if useful.)
Server side: c.Watch() still blocks until the apiserver sends those headers — and that's exactly the case @mimowo flagged. From kubernetes/kubernetes#136950:
If the API server does not have a "warm" cache […] before returning to Kueue it tries to convert them sequentially to v1beta2 […] estimated to take around 8min.
With 60s we'd cancel + retry on every cycle and never get past warmup, arguably worse than the pre-PR "wait forever, eventually succeed" for that path.
Proposal: bump defaultWatchEstablishTimeout to 10 min in this PR (matching @mimowo's "say 10min"). Tiny diff, preserves the hung-remote fix, avoids the cold-start regression.
For the more refined version — exponential timeout on retries that stacks with the failedConnAttempts / retryAfter machinery from #10990. I'll open a follow-up so this PR stays focused. WDYT?
There was a problem hiding this comment.
@trilamsr thank you for all the digging and super precise descriptions. We need to capture that with comments.
Proposal: bump defaultWatchEstablishTimeout to 10 min in this PR (matching @mimowo's "say 10min"). Tiny diff, preserves the hung-remote fix, avoids the cold-start regression.
I think 10min should be ok, espectially since we don't have now the discrepancy between the storage and serving versions, which bite us in 0.15. We may need to introduce the discrepancy when transitioning to v1beta3 one day, but at that point we will probably need to fix larger problems with kubernetes/kubernetes#136950
There was a problem hiding this comment.
Done in 9bd2d79. Bumped defaultWatchEstablishTimeout to 10 * time.Minute and expanded the doc block above it to capture the cold-cache + conversion-webhook reasoning and the link to kubernetes/kubernetes#136950. PTAL.
A hung client.Watch() against one remote MultiKueueCluster previously blocked the single multikueuecluster reconciler worker indefinitely, preventing every other cluster behind it from being reconciled. Those clusters keep remoteClient.connecting=true, the dispatcher then excludes them as inactive, and admission stops cluster-wide. Wrap the Watch establishment in a timeout-bounded helper. On timeout the in-flight Watch is canceled and an error is returned, so the existing failedConnAttempts / retryAfter backoff runs. Stream lifetime on the success path is unchanged: the returned watcher continues to use a context derived from the caller's ctx, and its cancel is owned by the watcher Stop method (no leak). Signed-off-by: Tri Lam <tree@lumalabs.ai>
Address review feedback: refactor the three subtests into a map-keyed table following the codebase's prevailing test style. Behaviour is unchanged; the per-case interceptor, expected error (matched via errors.Is), and elapsed-time ceiling are uniform. Signed-off-by: Tri Lam <trilamsr@gmail.com>
If c.Watch returns a non-nil watcher in the narrow window between time.After firing and the result-channel drain, the previous code discarded the watcher without calling Stop(). In production the watcher's HTTP stream is bound to establishCtx so cancel() tears it down indirectly, but fake clients used in tests ignore ctx and the watcher would leak. Drain the channel into a local and Stop() any returned watcher. Add a regression test using a sleeping interceptor and watch.NewFake() to assert Stop() was called. Signed-off-by: Tri Lam <trilamsr@gmail.com>
5d80a1c to
98a634b
Compare
60s would false-trip during apiserver watch-cache cold-start when the served version differs from the storage version and a conversion webhook is in play (kubernetes/kubernetes#136950, observed ~8 min at ~50k Workloads in Kueue 0.15). Expand the constant's doc comment to capture the rationale so future readers don't tighten the bound without understanding the cold-start path.
|
/release-note-edit |
|
@trilamsr release note proposal: /release-note-edit |
|
Thank you for fixing the issue upstream (in Kueue), nice contribution 👍 |
|
@mimowo: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
LGTM label has been added. DetailsGit tree hash: 4736e99c91362f408845898446646b17f4eda435 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, trilamsr The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Please also work on CPs in case the robot fails due to conflicts |
|
@mimowo: new pull request created: #11298 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@mimowo: new pull request created: #11299 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind bug
/area multikueue
What this PR does / why we need it:
A hung
client.Watch()against one remote MultiKueueCluster could block the singlemultikueueclusterreconciler worker indefinitely, preventing every other cluster behind it from being reconciled. Those clusters keepremoteClient.connecting=true, the dispatcher excludes them, and admission stops cluster-wide.This bounds the Watch establishment phase with
watchEstablishTimeout(default 60s, package-level var overridable in tests). On timeout the in-flight Watch is canceled anderrWatchEstablishTimeoutis returned, falling back to the existingfailedConnAttempts/retryAfterbackoff. The successful-watch stream lifetime is unchanged — the returned watcher continues to use a context derived from the caller'sctx, and its cancel is owned by the watcher'sStop()method via a small wrapper (no goroutine leak, no lostcancel warning).Follow-up to #9968, which fixed the most common trigger (Cloudflare Tunnel buffering empty chunked watch responses) by enabling
AllowWatchBookmarks. This change defends against any future condition that hangsclient.Watch()similarly.Which issue(s) this PR fixes:
Fixes #11206
Special notes for your reviewer:
watchWithEstablishTimeoutfor direct unit testing.TestWatchWithEstablishTimeoutcovers three paths: hung Watch (timeout), Watch error (propagated immediately), Watch success (no timeout wait).go test ./pkg/controller/admissionchecks/multikueue/— all package tests pass.Does this PR introduce a user-facing change?