docs: add local service backend pool updater design#10253
Conversation
✅ Deploy Preview for kubernetes-sigs-cloud-provide-azure ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: nilo19 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
|
||
| The `removeOperation(serviceName)` method cannot remove operations already in a processing snapshot. To avoid stale retry behavior, the processing path must re-check service relevance before requeueing and before sending retry/failure events. | ||
|
|
||
| Parked operations waiting for `nextEligibleAt` live in `updater.operations`, so `removeOperation(serviceName)` can remove them while they are parked. If removal races with the short snapshot-to-requeue window, the next tick's relevance check drops the stale operation quietly. |
There was a problem hiding this comment.
Does the window where a removed service's operations are in-flight (snapshotted, ARM call in progress) could result in a successful CreateOrUpdate adding IPs for a service that was just deleted?
There was a problem hiding this comment.
will the removeOpeaation() acquire the lock during backendpoolupdater updating backend pool?
|
|
||
| Updater-level retry uses the existing `LoadBalancerBackendPoolUpdateIntervalInSeconds` tick; this design does not add a separate sleep loop inside `process()`. Without `Retry-After`, default retry count `3` and default interval `30s` means a continuously failing retriable condition emits retrying events on the first three failed attempts and emits the final failed event on the fourth failed attempt, roughly 90 seconds after the first observed failure plus ARM call latency. Depending on where the first failure lands relative to the updater tick and how long each ARM call takes, the wall-clock time from the original EndpointSlice change to final failure can be close to or above two minutes. | ||
|
|
||
| For 429 throttling, `Retry-After` overrides the next normal updater tick by setting `nextEligibleAt`. Ticks before `nextEligibleAt` only preserve the operation in the queue after re-checking Service/LB relevance; they do not call ARM, emit retrying events, or consume retry budget. `LoadBalancerBackendPoolUpdateRetryCount` bounds failed processing attempts, not elapsed wall-clock time, so a long `Retry-After` can delay final success or failure beyond the normal interval-based timing. |
There was a problem hiding this comment.
If long Retry-After values park operations while EndpointSlice events keep arriving, the in-memory queue grows unboundedly. Is it likely for the queue to grow dangerously? Does the queue size needs to be bounded?
| ```go | ||
| // LoadBalancerBackendPoolUpdateRetryCount is the number of retries for retriable | ||
| // local-service backend-pool update failures. Defaults to 3. | ||
| LoadBalancerBackendPoolUpdateRetryCount *int `json:"loadBalancerBackendPoolUpdateRetryCount,omitempty" yaml:"loadBalancerBackendPoolUpdateRetryCount,omitempty"` |
There was a problem hiding this comment.
Consider LoadBalancerBackendPoolUpdateMaxRetries to make "max retries after first failure" clearer
|
|
||
| ## Metrics | ||
|
|
||
| The updater metric should describe terminal outcomes, not intermediate retry attempts. |
There was a problem hiding this comment.
Would a separate retry counter (e.g., backend_pool_update_retries_total) be useful for monitoring Azure API instability, separate from the terminal outcome metric?
| 1. ARM wire `429` from `Get` with a parseable `Retry-After` in `azcore.ResponseError.RawResponse` sets `nextEligibleAt`; ticks before that time requeue quietly without ARM calls, retry events, retry-count increments, or metrics. | ||
| 2. ARM wire `429` from `CreateOrUpdate` follows the same `Retry-After` and requeue behavior. |
There was a problem hiding this comment.
Should 429 with Retry-After also emit LoadBalancerBackendPoolUpdateRetrying?
There was a problem hiding this comment.
I think so, as we requeue this. What is the concern here?
| 5. ARM `409` or `412` from `CreateOrUpdate` requeues, emits `LoadBalancerBackendPoolUpdateRetrying`, then succeeds on the next tick after a fresh `Get`. | ||
| 6. If any operation in a `lbName/backendPoolName` group is waiting for `nextEligibleAt`, the whole group is preserved and no same-group operation is processed early. | ||
| 7. A fresh operation merged with a requeued operation keeps its own retry counter; on group failure, all operations in the group consume one retry. | ||
| 8. Retry budget exhaustion emits `LoadBalancerBackendPoolUpdateFailed` and leaves the queue empty. |
There was a problem hiding this comment.
Worth adding test case covering mixed-budget groups.
| - Queue-preservation ticks while waiting for `nextEligibleAt` do not record a metric. | ||
| - Success records one successful observation when `LoadBalancerBackendPoolUpdated` is emitted. | ||
| - Terminal failure records one failed observation when `LoadBalancerBackendPoolUpdateFailed` is emitted. | ||
| - Stale resource-not-found and stale Service/LB drops record no observation, matching the existing quiet-skip behavior. |
There was a problem hiding this comment.
Currently, a 404 from Get or CreateOrUpdate leaves isOperationSucceeded = false and the deferred ObserveOperationWithResult(false) still fires, recording a failure metric. This would be a behavior change.
|
|
||
| ## Retry Timing | ||
|
|
||
| Updater-level retry uses the existing `LoadBalancerBackendPoolUpdateIntervalInSeconds` tick; this design does not add a separate sleep loop inside `process()`. Without `Retry-After`, default retry count `3` and default interval `30s` means a continuously failing retriable condition emits retrying events on the first three failed attempts and emits the final failed event on the fourth failed attempt, roughly 90 seconds after the first observed failure plus ARM call latency. Depending on where the first failure lands relative to the updater tick and how long each ARM call takes, the wall-clock time from the original EndpointSlice change to final failure can be close to or above two minutes. |
There was a problem hiding this comment.
The main reconciliation path uses exponential backoff. The updater reuses the tick loop. Was using exponential backoff considered for the updater retry?
/kind design
What this PR does / why we need it:
Adds a design document for refining the local service backend pool updater retry and error handling behavior.
The design covers bounded updater-level retries, 429 throttling handling, 409/412 conflict retries, SDK retry boundaries, event semantics, queue merge behavior, shutdown behavior, metrics, and the expected unit test coverage.
Which issue(s) this PR fixes:
NONE
Special notes for your reviewer:
This is a design-only PR. It intentionally treats
retryrepectthrottled.GetRetriableStatusCode()statuses as terminal at the updater layer because the Azure SDK retry policy already handles those statuses before the updater sees the error.Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: