fix(agent): remove reconnection delay on DuplicateServerError by IvanHunters · Pull Request #817 · kubernetes-sigs/apiserver-network-proxy

IvanHunters · 2026-03-26T15:22:27Z

Description

This PR fixes a regression in v0.33.0 where agents fail to reconnect after connection loss, leaving Kubernetes nodes in permanent NotReady state.

Problem

In v0.33.0, when DuplicateServerError occurs during reconnection with clientsCount < serverCount, the code adds a ~1s delay:

} else {
    backoff = cs.resetBackoff()
    duration = wait.Jitter(backoff.Duration, backoff.Jitter)  // ~1s delay
}

Since newAgentClient() uses DNS load balancing, it connects to a random proxy pod. When a client disconnects, the reconnection attempt has only ~20% chance (1/5 pods) of selecting the correct pod. Wrong selections trigger DuplicateServerError, and the 1s delay significantly slows reconnection.

Impact: Average reconnection time increases from ~500ms to ~5s per client. With multiple simultaneous disconnects (e.g., VM restart), cascading failures prevent full recovery.

Solution

Remove the else block that adds delay. Let duration remain 0 (default value), enabling immediate retry through time.Sleep(0). This allows fast reconnection via DNS load balancing without creating a tight loop (the loop already includes gRPC dial time and network round trips).

Context

The release-0.33 branch already contains one fix from v0.34.0 (moving lastReceivedServerCount update to after successful AddClient()). This PR adds the second critical fix - removing the delay on reconnection attempts.

Together, these changes restore the fast reconnection behavior from v0.28.6 and align with the complete fix in v0.34.0.

Testing

Verified in production cluster:

Before fix: Nodes remain NotReady permanently after VM restart (v0.33.0)
After fix: Expected behavior matches v0.34.0 - agents reconnect within seconds

Related: #816

linux-foundation-easycla · 2026-03-26T15:22:36Z

❌ - login: @IvanHunters / name: ohotnikov.ivan . The commit (1ecf9aa) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

k8s-ci-robot · 2026-03-26T15:22:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: IvanHunters
Once this PR has been reviewed and has the lgtm label, please assign ipochi for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2026-03-26T15:22:38Z

Welcome @IvanHunters!

It looks like this is your first PR to kubernetes-sigs/apiserver-network-proxy 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/apiserver-network-proxy has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2026-03-26T15:22:40Z

Hi @IvanHunters. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

When clientsCount < serverCount after a client disconnects, the agent needs to establish a new connection quickly. The previous code added a ~1s delay (resetBackoff + jitter) on each DuplicateServerError, which occurs frequently due to DNS load balancing connecting to already-connected proxy pods. This delay causes slow reconnection (average ~5s per client) and cascading failures when multiple clients disconnect simultaneously, leaving nodes in permanent NotReady state. By removing the else block, duration remains 0, enabling immediate retry and fast reconnection through DNS load balancing. Signed-off-by: ohotnikov.ivan <49371933+IvanHunters@users.noreply.github.com>

cheftako · 2026-04-01T21:14:57Z

/ok-to-test

cheftako · 2026-04-24T15:35:19Z

-					backoff = cs.resetBackoff()
-					duration = wait.Jitter(backoff.Duration, backoff.Jitter)
 				}
+				// When clientsCount < serverCount, we need a new connection.


This has potential negative impact if something else is going on and now we generate a ton of connection requests. Happy to have this added as an option but I think it would be safer to add a flag for this behavior. We should also default to the existing behavior so we are backward compatible.

cheftako · 2026-04-24T15:35:31Z

/ok-to-test

cheftako · 2026-05-01T22:37:03Z

Can you rebase on the latest?
Also the PR cannot be merged until you sign the CLA

k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Mar 26, 2026

k8s-ci-robot requested review from ipochi and tallclair March 26, 2026 15:22

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 26, 2026

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 26, 2026

IvanHunters force-pushed the fix/agent-reconnection-delay-v0.33 branch from bb4ae40 to 1ecf9aa Compare March 26, 2026 15:24

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Mar 26, 2026

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 1, 2026

cheftako reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): remove reconnection delay on DuplicateServerError#817

fix(agent): remove reconnection delay on DuplicateServerError#817
IvanHunters wants to merge 1 commit into
kubernetes-sigs:release-0.33from
IvanHunters:fix/agent-reconnection-delay-v0.33

IvanHunters commented Mar 26, 2026 •

edited

Loading

Uh oh!

linux-foundation-easycla Bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Mar 26, 2026

Uh oh!

k8s-ci-robot commented Mar 26, 2026

Uh oh!

k8s-ci-robot commented Mar 26, 2026

Uh oh!

cheftako commented Apr 1, 2026

Uh oh!

cheftako Apr 24, 2026

Uh oh!

cheftako commented Apr 24, 2026

Uh oh!

cheftako commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

IvanHunters commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Context

Testing

Uh oh!

linux-foundation-easycla Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Mar 26, 2026

Uh oh!

k8s-ci-robot commented Mar 26, 2026

Uh oh!

k8s-ci-robot commented Mar 26, 2026

Uh oh!

cheftako commented Apr 1, 2026

Uh oh!

cheftako Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

cheftako commented Apr 24, 2026

Uh oh!

cheftako commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

IvanHunters commented Mar 26, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Mar 26, 2026 •

edited

Loading