Skip to content

[release-0.17] MultiKueue: prevent a hung remote watch from stopping all-cluster admission#11298

Merged
k8s-ci-robot merged 4 commits into
kubernetes-sigs:release-0.17from
k8s-infra-cherrypick-robot:cherry-pick-11207-to-release-0.17
May 18, 2026
Merged

[release-0.17] MultiKueue: prevent a hung remote watch from stopping all-cluster admission#11298
k8s-ci-robot merged 4 commits into
kubernetes-sigs:release-0.17from
k8s-infra-cherrypick-robot:cherry-pick-11207-to-release-0.17

Conversation

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

@k8s-infra-cherrypick-robot k8s-infra-cherrypick-robot commented May 18, 2026

This is an automated cherry-pick of #11207

/assign mimowo

NONE

trilamsr added 4 commits May 18, 2026 18:59
A hung client.Watch() against one remote MultiKueueCluster previously
blocked the single multikueuecluster reconciler worker indefinitely,
preventing every other cluster behind it from being reconciled. Those
clusters keep remoteClient.connecting=true, the dispatcher then excludes
them as inactive, and admission stops cluster-wide.

Wrap the Watch establishment in a timeout-bounded helper. On timeout
the in-flight Watch is canceled and an error is returned, so the
existing failedConnAttempts / retryAfter backoff runs. Stream lifetime
on the success path is unchanged: the returned watcher continues to use
a context derived from the caller's ctx, and its cancel is owned by
the watcher Stop method (no leak).

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Address review feedback: refactor the three subtests into a
map-keyed table following the codebase's prevailing test style.
Behaviour is unchanged; the per-case interceptor, expected error
(matched via errors.Is), and elapsed-time ceiling are uniform.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
If c.Watch returns a non-nil watcher in the narrow window between
time.After firing and the result-channel drain, the previous code
discarded the watcher without calling Stop(). In production the
watcher's HTTP stream is bound to establishCtx so cancel() tears it
down indirectly, but fake clients used in tests ignore ctx and the
watcher would leak.

Drain the channel into a local and Stop() any returned watcher.
Add a regression test using a sleeping interceptor and watch.NewFake()
to assert Stop() was called.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
60s would false-trip during apiserver watch-cache cold-start when the
served version differs from the storage version and a conversion webhook
is in play (kubernetes/kubernetes#136950, observed ~8 min at ~50k
Workloads in Kueue 0.15). Expand the constant's doc comment to capture
the rationale so future readers don't tighten the bound without
understanding the cold-start path.
@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label May 18, 2026
@k8s-ci-robot k8s-ci-robot added this to the v0.17 milestone May 18, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 18, 2026

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit 531f16f
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a0b61b17f576a00088463df
😎 Deploy Preview https://deploy-preview-11298--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 18, 2026
@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 18, 2026

/kind bug
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 18, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: c1ff1ece2b6c2d88b9be8b7c0825650d7b09195e

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: k8s-infra-cherrypick-robot, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2026
@k8s-ci-robot k8s-ci-robot merged commit 6535386 into kubernetes-sigs:release-0.17 May 18, 2026
36 checks passed
@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 19, 2026

Clearing the release note as we will have a combined release note in the follow up PR: #11304
/release-note-edit

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants