MultiKueue: prevent a hung remote watch from stopping all-cluster admission by trilamsr · Pull Request #11207 · kubernetes-sigs/kueue

trilamsr · 2026-05-15T00:54:39Z

What type of PR is this?

/kind bug
/area multikueue

What this PR does / why we need it:

A hung client.Watch() against one remote MultiKueueCluster could block the single multikueuecluster reconciler worker indefinitely, preventing every other cluster behind it from being reconciled. Those clusters keep remoteClient.connecting=true, the dispatcher excludes them, and admission stops cluster-wide.

This bounds the Watch establishment phase with watchEstablishTimeout (default 60s, package-level var overridable in tests). On timeout the in-flight Watch is canceled and errWatchEstablishTimeout is returned, falling back to the existing failedConnAttempts / retryAfter backoff. The successful-watch stream lifetime is unchanged — the returned watcher continues to use a context derived from the caller's ctx, and its cancel is owned by the watcher's Stop() method via a small wrapper (no goroutine leak, no lostcancel warning).

Follow-up to #9968, which fixed the most common trigger (Cloudflare Tunnel buffering empty chunked watch responses) by enabling AllowWatchBookmarks. This change defends against any future condition that hangs client.Watch() similarly.

Which issue(s) this PR fixes:

Fixes #11206

Special notes for your reviewer:

Helper extracted as watchWithEstablishTimeout for direct unit testing.
New unit test TestWatchWithEstablishTimeout covers three paths: hung Watch (timeout), Watch error (propagated immediately), Watch success (no timeout wait).
Tested locally with go test ./pkg/controller/admissionchecks/multikueue/ — all package tests pass.

Does this PR introduce a user-facing change?

MultiKueue: Fixed a bug where a hung watch connection to one remote cluster could block
reconciliation of other MultiKueueClusters, leaving them inactive and preventing workload
admission. Kueue now applies a 10-minute circuit-breaking timeout while establishing
remote-cluster watches, allowing reconciliation to recover instead of blocking indefinitely.

netlify · 2026-05-15T00:54:46Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`9bd2d79`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a0b4bbb4d66990008278ef5

k8s-ci-robot · 2026-05-15T00:54:50Z

Hi @trilamsr. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

tenzen-y · 2026-05-15T03:14:04Z

/ok-to-test

kshalot

/lgtm

I'm thinking about e2e testing this, although it might be tricky to simulate in an easy way (I was thinking of maybe using DROP in iptables).

k8s-ci-robot · 2026-05-15T13:15:43Z

LGTM label has been added.

Details

Git tree hash: 4f3aea30d663ca56ae7340e9771ba034777feaca

trilamsr · 2026-05-15T13:25:35Z

Thanks @kshalot!

Inline nit: TestEstablishWatch is now table-driven.
e2e via iptables -j DROP: interesting idea, that does seem like the cleanest way to simulate an unresponsive remote apiserver

mimowo · 2026-05-15T14:08:50Z

+	// Bounds how long startWatcher waits for client.Watch() to return,
+	// so a hung remote cannot head-of-line block the reconciler. See #11206.


Exploratory question to confirm my understanding: could this issue be also mitigated by increasing the GroupKindConcurrency level for MultiKueueCluster for the controller, wdyt? Code pointer to the configuration: https://github.com/kubernetes-sigs/kueue/blob/main/apis/config/v1beta2/configuration_types.go#L264C2-L264C22
In that case when one Reconcile is stuck the other may continue

could this issue be also mitigated by increasing the GroupKindConcurrency level…

I don't think it actually buys us anything in this code path. The knob is wired through correctly (the controller doesn't override MaxConcurrentReconciles), so in principle you'd get N parallel workers. But Reconcile calls setRemoteClientConfig, which grabs clustersReconciler.lock (a single controller wide write mutex at multikueuecluster.go:394) and holds it through the entire setConfig → establishWatch → c.Watch() chain. So when one worker is parked inside c.Watch(), every other Reconcile that wants to do anything with a different cluster sits waiting on that same lock.

In practice concurrency=N collapses back to 1 whenever any remote is hung. Bumping the knob would only help after we also split that lock per cluster, or released it before establishing the watch. Probably a separate PR. Happy to file a follow up issue for it if you want.

I see, thank you for the summary, I now understand why this would not work currently, and why fixing the problem is not straighforward.

Having said that, I think this renders GroupKindConcurrency useless for MultiKueueCluster due to very technical reasons which can be considered a bug by a user.

So, please file an issue for that. Feel free to work on it, or just keep it posted for some contributors.

Filed #11297 to track the lock refactor. Leaving it unassigned for now so it's open for any contributor to pick up.

mimowo · 2026-05-15T14:36:57Z

+
+	// Bounds how long startWatcher waits for client.Watch() to return,
+	// so a hung remote cannot head-of-line block the reconciler. See #11206.
+	defaultWatchEstablishTimeout = 60 * time.Second


I'm wondering if the timeout at 60s is safe in practice. IIUC this watch is used for all types, that includes Workloads, which might be numerous. And when Watch is opened IIUC all workloads are Listed, which may take a lot of time at the api-server side to prepare the response, in particular if the Workload objects need to go via conversion webhooks.

Recently we had an issue in Kueue (single cluster) that at scale of 50k workloads when conversion webhooks are running it may take around 8min, whilst the etcd compaction timeout is only around 2.5min. The test we are working on here is to prevent that in the future: #9145

Also, it should not be that long when conversion webhooks are not necessary.

If my suspection is correct then maybe we should:

use increased timeout for Workload objects (a bit hacky but maybe works well in practice (say 10min)

use exponential timeout which is increased on retries (we already have the retry mechanism recently improved here: Add reconnect backoff guardrail to suppress redundant cluster reconciles #10990

OTOH, in this particular case, will it actually list all the workloads? I think here it will just make na HTTP call, get the headers and then the events will start streaming (IIUC it will stream an ADDED event for every resource because we are not passing any resource version in the options).

AFAIR the 50k workloads issue happened because the Informer first does a List and then a Watch.

OTOH, in this particular case, will it actually list all the workloads? I think here it will just make na HTTP call, get the headers and then the events will start streaming (IIUC it will stream an ADDED event for every resource because we are not passing any resource version in the options).

Possible, I'm not sure about the exact semantics, if this is the case then we are good.

Let's research that more and support the claim (that opening the watch does not entail blocking on listing the workloads) with references or experiments.

Let's research that more…

Did some digging. TLDR: @kshalot is right on the client side, @mimowo is right on the server side, and the server side is the one that bites.

Client side: traced controller-runtime → client-go → newStreamWatcher. c.Watch() returns as soon as HTTP 200 headers arrive; there's no client-side List, no full-body buffering. So the 50k-workloads Informer scenario doesn't transfer directly. (Happy to drop in file:line refs if useful.)

Server side: c.Watch() still blocks until the apiserver sends those headers — and that's exactly the case @mimowo flagged. From kubernetes/kubernetes#136950:

If the API server does not have a "warm" cache […] before returning to Kueue it tries to convert them sequentially to v1beta2 […] estimated to take around 8min.

With 60s we'd cancel + retry on every cycle and never get past warmup, arguably worse than the pre-PR "wait forever, eventually succeed" for that path.

Proposal: bump defaultWatchEstablishTimeout to 10 min in this PR (matching @mimowo's "say 10min"). Tiny diff, preserves the hung-remote fix, avoids the cold-start regression.

For the more refined version — exponential timeout on retries that stacks with the failedConnAttempts / retryAfter machinery from #10990. I'll open a follow-up so this PR stays focused. WDYT?

@trilamsr thank you for all the digging and super precise descriptions. We need to capture that with comments.

Proposal: bump defaultWatchEstablishTimeout to 10 min in this PR (matching @mimowo's "say 10min"). Tiny diff, preserves the hung-remote fix, avoids the cold-start regression.

I think 10min should be ok, espectially since we don't have now the discrepancy between the storage and serving versions, which bite us in 0.15. We may need to introduce the discrepancy when transitioning to v1beta3 one day, but at that point we will probably need to fix larger problems with kubernetes/kubernetes#136950

Done in 9bd2d79. Bumped defaultWatchEstablishTimeout to 10 * time.Minute and expanded the doc block above it to capture the cold-cache + conversion-webhook reasoning and the link to kubernetes/kubernetes#136950. PTAL.

A hung client.Watch() against one remote MultiKueueCluster previously blocked the single multikueuecluster reconciler worker indefinitely, preventing every other cluster behind it from being reconciled. Those clusters keep remoteClient.connecting=true, the dispatcher then excludes them as inactive, and admission stops cluster-wide. Wrap the Watch establishment in a timeout-bounded helper. On timeout the in-flight Watch is canceled and an error is returned, so the existing failedConnAttempts / retryAfter backoff runs. Stream lifetime on the success path is unchanged: the returned watcher continues to use a context derived from the caller's ctx, and its cancel is owned by the watcher Stop method (no leak). Signed-off-by: Tri Lam <tree@lumalabs.ai>

Address review feedback: refactor the three subtests into a map-keyed table following the codebase's prevailing test style. Behaviour is unchanged; the per-case interceptor, expected error (matched via errors.Is), and elapsed-time ceiling are uniform. Signed-off-by: Tri Lam <trilamsr@gmail.com>

If c.Watch returns a non-nil watcher in the narrow window between time.After firing and the result-channel drain, the previous code discarded the watcher without calling Stop(). In production the watcher's HTTP stream is bound to establishCtx so cancel() tears it down indirectly, but fake clients used in tests ignore ctx and the watcher would leak. Drain the channel into a local and Stop() any returned watcher. Add a regression test using a sleeping interceptor and watch.NewFake() to assert Stop() was called. Signed-off-by: Tri Lam <trilamsr@gmail.com>

60s would false-trip during apiserver watch-cache cold-start when the served version differs from the storage version and a conversion webhook is in play (kubernetes/kubernetes#136950, observed ~8 min at ~50k Workloads in Kueue 0.15). Expand the constant's doc comment to capture the rationale so future readers don't tighten the bound without understanding the cold-start path.

mimowo · 2026-05-18T17:35:28Z

/release-note-edit

MultiKueue: bound the establishment phase of remote watches with a 10min timeout so a hung `client.Watch()` against one remote cannot head-of-line block the multikueuecluster reconciler worker and stop admission cluster-wide.

mimowo · 2026-05-18T17:38:47Z

@trilamsr release note proposal:

/release-note-edit

MultiKueue: Fixed a bug where a hung watch connection to one remote cluster could block
reconciliation of other MultiKueueClusters, leaving them inactive and preventing workload
admission. Kueue now applies a 10-minute circuit-breaking timeout while establishing
remote-cluster watches, allowing reconciliation to recover instead of blocking indefinitely.

mimowo · 2026-05-18T17:40:08Z

Thank you for fixing the issue upstream (in Kueue), nice contribution 👍
/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16

k8s-infra-cherrypick-robot · 2026-05-18T17:40:11Z

@mimowo: once the present PR merges, I will cherry-pick it on top of release-0.16, release-0.17 in new PRs and assign them to you.

Details

In response to this:

Thank you for fixing the issue upstream (in Kueue), nice contribution 👍
/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-05-18T17:40:17Z

LGTM label has been added.

Details

Git tree hash: 4736e99c91362f408845898446646b17f4eda435

k8s-ci-robot · 2026-05-18T17:40:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, trilamsr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [mimowo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mimowo · 2026-05-18T17:40:25Z

Please also work on CPs in case the robot fails due to conflicts

k8s-infra-cherrypick-robot · 2026-05-18T18:59:59Z

@mimowo: new pull request created: #11298

Details

In response to this:

Thank you for fixing the issue upstream (in Kueue), nice contribution 👍
/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-infra-cherrypick-robot · 2026-05-18T19:00:36Z

@mimowo: new pull request created: #11299

Details

In response to this:

Thank you for fixing the issue upstream (in Kueue), nice contribution 👍
/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot requested review from PBundyra and kshalot May 15, 2026 00:54

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 15, 2026

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 15, 2026

trilamsr mentioned this pull request May 15, 2026

[MultiKueue] One hung remote watch blocks dispatch to every other cluster #11206

Closed

trilamsr force-pushed the fix/multikueue-watch-establish-timeout branch 2 times, most recently from 10565ed to 89b9ab3 Compare May 15, 2026 01:04

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label May 15, 2026

trilamsr changed the title ~~Bound establishment phase of MultiKueue remote watches~~ MultiKueue: prevent a hung remote watch from stopping all-cluster admission May 15, 2026

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 15, 2026

kshalot reviewed May 15, 2026

View reviewed changes

Comment thread pkg/controller/admissionchecks/multikueue/multikueuecluster_test.go

k8s-ci-robot assigned kshalot May 15, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 15, 2026

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 15, 2026

k8s-ci-robot requested a review from kshalot May 15, 2026 13:25

mimowo reviewed May 15, 2026

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 15, 2026

trilamsr added 2 commits May 18, 2026 10:11

trilamsr force-pushed the fix/multikueue-watch-establish-timeout branch from 5d80a1c to 98a634b Compare May 18, 2026 17:12

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 18, 2026

k8s-ci-robot assigned mimowo May 18, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 18, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2026

trilamsr mentioned this pull request May 18, 2026

MultiKueue: GroupKindConcurrency has no effect because of a controller wide lock #11297

Open

k8s-ci-robot merged commit 91bdd46 into kubernetes-sigs:main May 18, 2026
37 checks passed

k8s-ci-robot added this to the v0.18 milestone May 18, 2026

k8s-infra-cherrypick-robot mentioned this pull request May 18, 2026

[release-0.17] MultiKueue: prevent a hung remote watch from stopping all-cluster admission #11298

Merged

k8s-infra-cherrypick-robot mentioned this pull request May 18, 2026

[release-0.16] MultiKueue: prevent a hung remote watch from stopping all-cluster admission #11299

Merged

This was referenced May 19, 2026

MultiKueue: hung remote takes 10 minutes to detect, should be 1 #11303

Open

MultiKueue: detect a hung remote in 1 minute, not 10 #11304

Open

		// Bounds how long startWatcher waits for client.Watch() to return,
		// so a hung remote cannot head-of-line block the reconciler. See #11206.

Conversation

trilamsr commented May 15, 2026 • edited by k8s-ci-robot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

netlify Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Uh oh!

k8s-ci-robot commented May 15, 2026

Uh oh!

tenzen-y commented May 15, 2026

Uh oh!

kshalot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

k8s-ci-robot commented May 15, 2026

Uh oh!

trilamsr commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mimowo May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trilamsr May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mimowo commented May 18, 2026

Uh oh!

mimowo commented May 18, 2026

Uh oh!

mimowo commented May 18, 2026

Uh oh!

k8s-infra-cherrypick-robot commented May 18, 2026

Uh oh!

k8s-ci-robot commented May 18, 2026

Uh oh!

k8s-ci-robot commented May 18, 2026

Uh oh!

mimowo commented May 18, 2026

Uh oh!

Uh oh!

k8s-infra-cherrypick-robot commented May 18, 2026

Uh oh!

k8s-infra-cherrypick-robot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

trilamsr commented May 15, 2026 •

edited by k8s-ci-robot

Loading

netlify Bot commented May 15, 2026 •

edited

Loading

trilamsr commented May 15, 2026 •

edited

Loading

mimowo May 18, 2026 •

edited

Loading

trilamsr May 18, 2026 •

edited

Loading