Skip to content

MultiKueue: prevent a hung remote watch from stopping all-cluster admission#11207

Merged
k8s-ci-robot merged 4 commits into
kubernetes-sigs:mainfrom
trilamsr:fix/multikueue-watch-establish-timeout
May 18, 2026
Merged

MultiKueue: prevent a hung remote watch from stopping all-cluster admission#11207
k8s-ci-robot merged 4 commits into
kubernetes-sigs:mainfrom
trilamsr:fix/multikueue-watch-establish-timeout

Conversation

@trilamsr
Copy link
Copy Markdown
Contributor

@trilamsr trilamsr commented May 15, 2026

What type of PR is this?

/kind bug
/area multikueue

What this PR does / why we need it:

A hung client.Watch() against one remote MultiKueueCluster could block the single multikueuecluster reconciler worker indefinitely, preventing every other cluster behind it from being reconciled. Those clusters keep remoteClient.connecting=true, the dispatcher excludes them, and admission stops cluster-wide.

This bounds the Watch establishment phase with watchEstablishTimeout (default 60s, package-level var overridable in tests). On timeout the in-flight Watch is canceled and errWatchEstablishTimeout is returned, falling back to the existing failedConnAttempts / retryAfter backoff. The successful-watch stream lifetime is unchanged — the returned watcher continues to use a context derived from the caller's ctx, and its cancel is owned by the watcher's Stop() method via a small wrapper (no goroutine leak, no lostcancel warning).

Follow-up to #9968, which fixed the most common trigger (Cloudflare Tunnel buffering empty chunked watch responses) by enabling AllowWatchBookmarks. This change defends against any future condition that hangs client.Watch() similarly.

Which issue(s) this PR fixes:

Fixes #11206

Special notes for your reviewer:

  • Helper extracted as watchWithEstablishTimeout for direct unit testing.
  • New unit test TestWatchWithEstablishTimeout covers three paths: hung Watch (timeout), Watch error (propagated immediately), Watch success (no timeout wait).
  • Tested locally with go test ./pkg/controller/admissionchecks/multikueue/ — all package tests pass.

Does this PR introduce a user-facing change?

MultiKueue: Fixed a bug where a hung watch connection to one remote cluster could block
reconciliation of other MultiKueueClusters, leaving them inactive and preventing workload
admission. Kueue now applies a 10-minute circuit-breaking timeout while establishing
remote-cluster watches, allowing reconciliation to recover instead of blocking indefinitely.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/bug Categorizes issue or PR as related to a bug. area/multikueue Issues or PRs related to MultiKueue labels May 15, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 15, 2026

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 9bd2d79
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a0b4bbb4d66990008278ef5

@k8s-ci-robot k8s-ci-robot requested review from PBundyra and kshalot May 15, 2026 00:54
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 15, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @trilamsr. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 15, 2026
@trilamsr trilamsr force-pushed the fix/multikueue-watch-establish-timeout branch 2 times, most recently from 10565ed to 89b9ab3 Compare May 15, 2026 01:04
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label May 15, 2026
@trilamsr trilamsr changed the title Bound establishment phase of MultiKueue remote watches MultiKueue: prevent a hung remote watch from stopping all-cluster admission May 15, 2026
@tenzen-y
Copy link
Copy Markdown
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 15, 2026
Copy link
Copy Markdown
Contributor

@kshalot kshalot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

I'm thinking about e2e testing this, although it might be tricky to simulate in an easy way (I was thinking of maybe using DROP in iptables).

Comment thread pkg/controller/admissionchecks/multikueue/multikueuecluster_test.go
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 15, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 4f3aea30d663ca56ae7340e9771ba034777feaca

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 15, 2026
@k8s-ci-robot k8s-ci-robot requested a review from kshalot May 15, 2026 13:25
@trilamsr
Copy link
Copy Markdown
Contributor Author

trilamsr commented May 15, 2026

Thanks @kshalot!

  • Inline nit: TestEstablishWatch is now table-driven.
  • e2e via iptables -j DROP: interesting idea, that does seem like the cleanest way to simulate an unresponsive remote apiserver

Comment on lines +74 to +75
// Bounds how long startWatcher waits for client.Watch() to return,
// so a hung remote cannot head-of-line block the reconciler. See #11206.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exploratory question to confirm my understanding: could this issue be also mitigated by increasing the GroupKindConcurrency level for MultiKueueCluster for the controller, wdyt? Code pointer to the configuration: https://github.com/kubernetes-sigs/kueue/blob/main/apis/config/v1beta2/configuration_types.go#L264C2-L264C22
In that case when one Reconcile is stuck the other may continue

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this issue be also mitigated by increasing the GroupKindConcurrency level…

I don't think it actually buys us anything in this code path. The knob is wired through correctly (the controller doesn't override MaxConcurrentReconciles), so in principle you'd get N parallel workers. But Reconcile calls setRemoteClientConfig, which grabs clustersReconciler.lock (a single controller wide write mutex at multikueuecluster.go:394) and holds it through the entire setConfigestablishWatchc.Watch() chain. So when one worker is parked inside c.Watch(), every other Reconcile that wants to do anything with a different cluster sits waiting on that same lock.

In practice concurrency=N collapses back to 1 whenever any remote is hung. Bumping the knob would only help after we also split that lock per cluster, or released it before establishing the watch. Probably a separate PR. Happy to file a follow up issue for it if you want.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thank you for the summary, I now understand why this would not work currently, and why fixing the problem is not straighforward.

Having said that, I think this renders GroupKindConcurrency useless for MultiKueueCluster due to very technical reasons which can be considered a bug by a user.

So, please file an issue for that. Feel free to work on it, or just keep it posted for some contributors.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #11297 to track the lock refactor. Leaving it unassigned for now so it's open for any contributor to pick up.


// Bounds how long startWatcher waits for client.Watch() to return,
// so a hung remote cannot head-of-line block the reconciler. See #11206.
defaultWatchEstablishTimeout = 60 * time.Second
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the timeout at 60s is safe in practice. IIUC this watch is used for all types, that includes Workloads, which might be numerous. And when Watch is opened IIUC all workloads are Listed, which may take a lot of time at the api-server side to prepare the response, in particular if the Workload objects need to go via conversion webhooks.

Recently we had an issue in Kueue (single cluster) that at scale of 50k workloads when conversion webhooks are running it may take around 8min, whilst the etcd compaction timeout is only around 2.5min. The test we are working on here is to prevent that in the future: #9145

Also, it should not be that long when conversion webhooks are not necessary.

If my suspection is correct then maybe we should:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTOH, in this particular case, will it actually list all the workloads? I think here it will just make na HTTP call, get the headers and then the events will start streaming (IIUC it will stream an ADDED event for every resource because we are not passing any resource version in the options).

AFAIR the 50k workloads issue happened because the Informer first does a List and then a Watch.

Copy link
Copy Markdown
Contributor

@mimowo mimowo May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTOH, in this particular case, will it actually list all the workloads? I think here it will just make na HTTP call, get the headers and then the events will start streaming (IIUC it will stream an ADDED event for every resource because we are not passing any resource version in the options).

Possible, I'm not sure about the exact semantics, if this is the case then we are good.

Let's research that more and support the claim (that opening the watch does not entail blocking on listing the workloads) with references or experiments.

Copy link
Copy Markdown
Contributor Author

@trilamsr trilamsr May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's research that more…

Did some digging. TLDR: @kshalot is right on the client side, @mimowo is right on the server side, and the server side is the one that bites.

Client side: traced controller-runtime → client-go → newStreamWatcher. c.Watch() returns as soon as HTTP 200 headers arrive; there's no client-side List, no full-body buffering. So the 50k-workloads Informer scenario doesn't transfer directly. (Happy to drop in file:line refs if useful.)

Server side: c.Watch() still blocks until the apiserver sends those headers — and that's exactly the case @mimowo flagged. From kubernetes/kubernetes#136950:

If the API server does not have a "warm" cache […] before returning to Kueue it tries to convert them sequentially to v1beta2 […] estimated to take around 8min.

With 60s we'd cancel + retry on every cycle and never get past warmup, arguably worse than the pre-PR "wait forever, eventually succeed" for that path.

Proposal: bump defaultWatchEstablishTimeout to 10 min in this PR (matching @mimowo's "say 10min"). Tiny diff, preserves the hung-remote fix, avoids the cold-start regression.

For the more refined version — exponential timeout on retries that stacks with the failedConnAttempts / retryAfter machinery from #10990. I'll open a follow-up so this PR stays focused. WDYT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trilamsr thank you for all the digging and super precise descriptions. We need to capture that with comments.

Proposal: bump defaultWatchEstablishTimeout to 10 min in this PR (matching @mimowo's "say 10min"). Tiny diff, preserves the hung-remote fix, avoids the cold-start regression.

I think 10min should be ok, espectially since we don't have now the discrepancy between the storage and serving versions, which bite us in 0.15. We may need to introduce the discrepancy when transitioning to v1beta3 one day, but at that point we will probably need to fix larger problems with kubernetes/kubernetes#136950

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 9bd2d79. Bumped defaultWatchEstablishTimeout to 10 * time.Minute and expanded the doc block above it to capture the cold-cache + conversion-webhook reasoning and the link to kubernetes/kubernetes#136950. PTAL.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 15, 2026
A hung client.Watch() against one remote MultiKueueCluster previously
blocked the single multikueuecluster reconciler worker indefinitely,
preventing every other cluster behind it from being reconciled. Those
clusters keep remoteClient.connecting=true, the dispatcher then excludes
them as inactive, and admission stops cluster-wide.

Wrap the Watch establishment in a timeout-bounded helper. On timeout
the in-flight Watch is canceled and an error is returned, so the
existing failedConnAttempts / retryAfter backoff runs. Stream lifetime
on the success path is unchanged: the returned watcher continues to use
a context derived from the caller's ctx, and its cancel is owned by
the watcher Stop method (no leak).

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added 2 commits May 18, 2026 10:11
Address review feedback: refactor the three subtests into a
map-keyed table following the codebase's prevailing test style.
Behaviour is unchanged; the per-case interceptor, expected error
(matched via errors.Is), and elapsed-time ceiling are uniform.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
If c.Watch returns a non-nil watcher in the narrow window between
time.After firing and the result-channel drain, the previous code
discarded the watcher without calling Stop(). In production the
watcher's HTTP stream is bound to establishCtx so cancel() tears it
down indirectly, but fake clients used in tests ignore ctx and the
watcher would leak.

Drain the channel into a local and Stop() any returned watcher.
Add a regression test using a sleeping interceptor and watch.NewFake()
to assert Stop() was called.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr force-pushed the fix/multikueue-watch-establish-timeout branch from 5d80a1c to 98a634b Compare May 18, 2026 17:12
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 18, 2026
60s would false-trip during apiserver watch-cache cold-start when the
served version differs from the storage version and a conversion webhook
is in play (kubernetes/kubernetes#136950, observed ~8 min at ~50k
Workloads in Kueue 0.15). Expand the constant's doc comment to capture
the rationale so future readers don't tighten the bound without
understanding the cold-start path.
@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 18, 2026

/release-note-edit

MultiKueue: bound the establishment phase of remote watches with a 10min timeout so a hung `client.Watch()` against one remote cannot head-of-line block the multikueuecluster reconciler worker and stop admission cluster-wide.

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 18, 2026

@trilamsr release note proposal:

/release-note-edit

MultiKueue: Fixed a bug where a hung watch connection to one remote cluster could block
reconciliation of other MultiKueueClusters, leaving them inactive and preventing workload
admission. Kueue now applies a 10-minute circuit-breaking timeout while establishing
remote-cluster watches, allowing reconciliation to recover instead of blocking indefinitely.

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 18, 2026

Thank you for fixing the issue upstream (in Kueue), nice contribution 👍
/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

@mimowo: once the present PR merges, I will cherry-pick it on top of release-0.16, release-0.17 in new PRs and assign them to you.

Details

In response to this:

Thank you for fixing the issue upstream (in Kueue), nice contribution 👍
/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 18, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 4736e99c91362f408845898446646b17f4eda435

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, trilamsr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2026
@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 18, 2026

Please also work on CPs in case the robot fails due to conflicts

@k8s-ci-robot k8s-ci-robot merged commit 91bdd46 into kubernetes-sigs:main May 18, 2026
37 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.18 milestone May 18, 2026
@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

@mimowo: new pull request created: #11298

Details

In response to this:

Thank you for fixing the issue upstream (in Kueue), nice contribution 👍
/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

@mimowo: new pull request created: #11299

Details

In response to this:

Thank you for fixing the issue upstream (in Kueue), nice contribution 👍
/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/multikueue Issues or PRs related to MultiKueue cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[MultiKueue] One hung remote watch blocks dispatch to every other cluster

6 participants