docs: add local service backend pool updater design by nilo19 · Pull Request #10253 · kubernetes-sigs/cloud-provider-azure

nilo19 · 2026-05-03T12:34:26Z

/kind design

What this PR does / why we need it:

Adds a design document for refining the local service backend pool updater retry and error handling behavior.

The design covers bounded updater-level retries, 429 throttling handling, 409/412 conflict retries, SDK retry boundaries, event semantics, queue merge behavior, shutdown behavior, metrics, and the expected unit test coverage.

Which issue(s) this PR fixes:

NONE

Special notes for your reviewer:

This is a design-only PR. It intentionally treats retryrepectthrottled.GetRetriableStatusCode() statuses as terminal at the updater layer because the Azure SDK retry policy already handles those statuses before the updater sees the error.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Adds a design doc under content/en/development/design-docs/.

netlify · 2026-05-03T12:34:32Z

✅ Deploy Preview for kubernetes-sigs-cloud-provide-azure ready!

Name	Link
🔨 Latest commit	`e91308a`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-cloud-provide-azure/deploys/69f740d6d1bb1c0008c1a292
😎 Deploy Preview	https://deploy-preview-10253--kubernetes-sigs-cloud-provide-azure.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2026-05-03T12:34:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nilo19

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [nilo19]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Liunardy · 2026-05-13T07:13:16Z

+
+The `removeOperation(serviceName)` method cannot remove operations already in a processing snapshot. To avoid stale retry behavior, the processing path must re-check service relevance before requeueing and before sending retry/failure events.
+
+Parked operations waiting for `nextEligibleAt` live in `updater.operations`, so `removeOperation(serviceName)` can remove them while they are parked. If removal races with the short snapshot-to-requeue window, the next tick's relevance check drops the stale operation quietly.


Does the window where a removed service's operations are in-flight (snapshotted, ARM call in progress) could result in a successful CreateOrUpdate adding IPs for a service that was just deleted?

will the removeOpeaation() acquire the lock during backendpoolupdater updating backend pool?

Liunardy · 2026-05-13T07:18:16Z

+
+Updater-level retry uses the existing `LoadBalancerBackendPoolUpdateIntervalInSeconds` tick; this design does not add a separate sleep loop inside `process()`. Without `Retry-After`, default retry count `3` and default interval `30s` means a continuously failing retriable condition emits retrying events on the first three failed attempts and emits the final failed event on the fourth failed attempt, roughly 90 seconds after the first observed failure plus ARM call latency. Depending on where the first failure lands relative to the updater tick and how long each ARM call takes, the wall-clock time from the original EndpointSlice change to final failure can be close to or above two minutes.
+
+For 429 throttling, `Retry-After` overrides the next normal updater tick by setting `nextEligibleAt`. Ticks before `nextEligibleAt` only preserve the operation in the queue after re-checking Service/LB relevance; they do not call ARM, emit retrying events, or consume retry budget. `LoadBalancerBackendPoolUpdateRetryCount` bounds failed processing attempts, not elapsed wall-clock time, so a long `Retry-After` can delay final success or failure beyond the normal interval-based timing.


If long Retry-After values park operations while EndpointSlice events keep arriving, the in-memory queue grows unboundedly. Is it likely for the queue to grow dangerously? Does the queue size needs to be bounded?

Liunardy · 2026-05-13T07:18:57Z

+```go
+// LoadBalancerBackendPoolUpdateRetryCount is the number of retries for retriable
+// local-service backend-pool update failures. Defaults to 3.
+LoadBalancerBackendPoolUpdateRetryCount *int `json:"loadBalancerBackendPoolUpdateRetryCount,omitempty" yaml:"loadBalancerBackendPoolUpdateRetryCount,omitempty"`


Consider LoadBalancerBackendPoolUpdateMaxRetries to make "max retries after first failure" clearer

Liunardy · 2026-05-13T07:21:45Z

+
+## Metrics
+
+The updater metric should describe terminal outcomes, not intermediate retry attempts.


Would a separate retry counter (e.g., backend_pool_update_retries_total) be useful for monitoring Azure API instability, separate from the terminal outcome metric?

Liunardy · 2026-05-13T07:22:20Z

+1. ARM wire `429` from `Get` with a parseable `Retry-After` in `azcore.ResponseError.RawResponse` sets `nextEligibleAt`; ticks before that time requeue quietly without ARM calls, retry events, retry-count increments, or metrics.
+2. ARM wire `429` from `CreateOrUpdate` follows the same `Retry-After` and requeue behavior.


Should 429 with Retry-After also emit LoadBalancerBackendPoolUpdateRetrying?

I think so, as we requeue this. What is the concern here?

Liunardy · 2026-05-13T07:23:14Z

+5. ARM `409` or `412` from `CreateOrUpdate` requeues, emits `LoadBalancerBackendPoolUpdateRetrying`, then succeeds on the next tick after a fresh `Get`.
+6. If any operation in a `lbName/backendPoolName` group is waiting for `nextEligibleAt`, the whole group is preserved and no same-group operation is processed early.
+7. A fresh operation merged with a requeued operation keeps its own retry counter; on group failure, all operations in the group consume one retry.
+8. Retry budget exhaustion emits `LoadBalancerBackendPoolUpdateFailed` and leaves the queue empty.


Worth adding test case covering mixed-budget groups.

Liunardy · 2026-05-13T07:24:27Z

+- Queue-preservation ticks while waiting for `nextEligibleAt` do not record a metric.
+- Success records one successful observation when `LoadBalancerBackendPoolUpdated` is emitted.
+- Terminal failure records one failed observation when `LoadBalancerBackendPoolUpdateFailed` is emitted.
+- Stale resource-not-found and stale Service/LB drops record no observation, matching the existing quiet-skip behavior.


Currently, a 404 from Get or CreateOrUpdate leaves isOperationSucceeded = false and the deferred ObserveOperationWithResult(false) still fires, recording a failure metric. This would be a behavior change.

Liunardy · 2026-05-13T07:25:09Z

+
+## Retry Timing
+
+Updater-level retry uses the existing `LoadBalancerBackendPoolUpdateIntervalInSeconds` tick; this design does not add a separate sleep loop inside `process()`. Without `Retry-After`, default retry count `3` and default interval `30s` means a continuously failing retriable condition emits retrying events on the first three failed attempts and emits the final failed event on the fourth failed attempt, roughly 90 seconds after the first observed failure plus ARM call latency. Depending on where the first failure lands relative to the updater tick and how long each ARM call takes, the wall-clock time from the original EndpointSlice change to final failure can be close to or above two minutes.


The main reconciliation path uses exponential backoff. The updater reuses the tick loop. Was using exponential backoff considered for the updater retry?

docs: add local service backend pool updater design

e91308a

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/design Categorizes issue or PR as related to design. labels May 3, 2026

k8s-ci-robot requested review from MartinForReal and feiskyer May 3, 2026 12:34

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 3, 2026

github-actions Bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label May 3, 2026

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 3, 2026

This was referenced May 6, 2026

loadBalancerBackendPoolUpdater silently discards backend pool update when EndpointSlice informer cache is stale at startup — no retry, backend pool stays empty permanently #10252

Open

Refine loadBalancerBackendPoolUpdater #10270

Open

Liunardy reviewed May 13, 2026

View reviewed changes

Liunardy mentioned this pull request May 21, 2026

doc: add local service backend pool updater retry design #10417

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add local service backend pool updater design#10253

docs: add local service backend pool updater design#10253
nilo19 wants to merge 1 commit into
kubernetes-sigs:documentationfrom
nilo19:doc/local-service-backend-pool-updater-design

nilo19 commented May 3, 2026

Uh oh!

netlify Bot commented May 3, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented May 3, 2026

Uh oh!

Liunardy May 13, 2026

Uh oh!

nilo19 May 14, 2026

Uh oh!

Liunardy May 13, 2026

Uh oh!

Liunardy May 13, 2026

Uh oh!

Liunardy May 13, 2026

Uh oh!

Liunardy May 13, 2026

Uh oh!

nilo19 May 14, 2026

Uh oh!

Liunardy May 13, 2026

Uh oh!

Liunardy May 13, 2026

Uh oh!

Liunardy May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		The `removeOperation(serviceName)` method cannot remove operations already in a processing snapshot. To avoid stale retry behavior, the processing path must re-check service relevance before requeueing and before sending retry/failure events.

		Parked operations waiting for `nextEligibleAt` live in `updater.operations`, so `removeOperation(serviceName)` can remove them while they are parked. If removal races with the short snapshot-to-requeue window, the next tick's relevance check drops the stale operation quietly.


		Updater-level retry uses the existing `LoadBalancerBackendPoolUpdateIntervalInSeconds` tick; this design does not add a separate sleep loop inside `process()`. Without `Retry-After`, default retry count `3` and default interval `30s` means a continuously failing retriable condition emits retrying events on the first three failed attempts and emits the final failed event on the fourth failed attempt, roughly 90 seconds after the first observed failure plus ARM call latency. Depending on where the first failure lands relative to the updater tick and how long each ARM call takes, the wall-clock time from the original EndpointSlice change to final failure can be close to or above two minutes.

		For 429 throttling, `Retry-After` overrides the next normal updater tick by setting `nextEligibleAt`. Ticks before `nextEligibleAt` only preserve the operation in the queue after re-checking Service/LB relevance; they do not call ARM, emit retrying events, or consume retry budget. `LoadBalancerBackendPoolUpdateRetryCount` bounds failed processing attempts, not elapsed wall-clock time, so a long `Retry-After` can delay final success or failure beyond the normal interval-based timing.


		## Metrics

		The updater metric should describe terminal outcomes, not intermediate retry attempts.

		1. ARM wire `429` from `Get` with a parseable `Retry-After` in `azcore.ResponseError.RawResponse` sets `nextEligibleAt`; ticks before that time requeue quietly without ARM calls, retry events, retry-count increments, or metrics.
		2. ARM wire `429` from `CreateOrUpdate` follows the same `Retry-After` and requeue behavior.


		## Retry Timing

		Updater-level retry uses the existing `LoadBalancerBackendPoolUpdateIntervalInSeconds` tick; this design does not add a separate sleep loop inside `process()`. Without `Retry-After`, default retry count `3` and default interval `30s` means a continuously failing retriable condition emits retrying events on the first three failed attempts and emits the final failed event on the fourth failed attempt, roughly 90 seconds after the first observed failure plus ARM call latency. Depending on where the first failure lands relative to the updater tick and how long each ARM call takes, the wall-clock time from the original EndpointSlice change to final failure can be close to or above two minutes.

Conversation

nilo19 commented May 3, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

netlify Bot commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-cloud-provide-azure ready!

Uh oh!

k8s-ci-robot commented May 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netlify Bot commented May 3, 2026 •

edited

Loading