fix(agent): remove reconnection delay on DuplicateServerError#817
fix(agent): remove reconnection delay on DuplicateServerError#817IvanHunters wants to merge 1 commit into
Conversation
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: IvanHunters The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @IvanHunters! |
|
Hi @IvanHunters. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
When clientsCount < serverCount after a client disconnects, the agent needs to establish a new connection quickly. The previous code added a ~1s delay (resetBackoff + jitter) on each DuplicateServerError, which occurs frequently due to DNS load balancing connecting to already-connected proxy pods. This delay causes slow reconnection (average ~5s per client) and cascading failures when multiple clients disconnect simultaneously, leaving nodes in permanent NotReady state. By removing the else block, duration remains 0, enabling immediate retry and fast reconnection through DNS load balancing. Signed-off-by: ohotnikov.ivan <49371933+IvanHunters@users.noreply.github.com>
bb4ae40 to
1ecf9aa
Compare
|
/ok-to-test |
| backoff = cs.resetBackoff() | ||
| duration = wait.Jitter(backoff.Duration, backoff.Jitter) | ||
| } | ||
| // When clientsCount < serverCount, we need a new connection. |
There was a problem hiding this comment.
This has potential negative impact if something else is going on and now we generate a ton of connection requests. Happy to have this added as an option but I think it would be safer to add a flag for this behavior. We should also default to the existing behavior so we are backward compatible.
|
/ok-to-test |
|
Can you rebase on the latest? |
Description
This PR fixes a regression in v0.33.0 where agents fail to reconnect after connection loss, leaving Kubernetes nodes in permanent
NotReadystate.Problem
In v0.33.0, when
DuplicateServerErroroccurs during reconnection withclientsCount < serverCount, the code adds a ~1s delay:Since
newAgentClient()uses DNS load balancing, it connects to a random proxy pod. When a client disconnects, the reconnection attempt has only ~20% chance (1/5 pods) of selecting the correct pod. Wrong selections triggerDuplicateServerError, and the 1s delay significantly slows reconnection.Impact: Average reconnection time increases from ~500ms to ~5s per client. With multiple simultaneous disconnects (e.g., VM restart), cascading failures prevent full recovery.
Solution
Remove the else block that adds delay. Let
durationremain 0 (default value), enabling immediate retry throughtime.Sleep(0). This allows fast reconnection via DNS load balancing without creating a tight loop (the loop already includes gRPC dial time and network round trips).Context
The release-0.33 branch already contains one fix from v0.34.0 (moving
lastReceivedServerCountupdate to after successfulAddClient()). This PR adds the second critical fix - removing the delay on reconnection attempts.Together, these changes restore the fast reconnection behavior from v0.28.6 and align with the complete fix in v0.34.0.
Testing
Verified in production cluster:
Related: #816