Skip to content

docs: add local service backend pool updater design#10253

Open
nilo19 wants to merge 1 commit into
kubernetes-sigs:documentationfrom
nilo19:doc/local-service-backend-pool-updater-design
Open

docs: add local service backend pool updater design#10253
nilo19 wants to merge 1 commit into
kubernetes-sigs:documentationfrom
nilo19:doc/local-service-backend-pool-updater-design

Conversation

@nilo19
Copy link
Copy Markdown
Contributor

@nilo19 nilo19 commented May 3, 2026

/kind design

What this PR does / why we need it:

Adds a design document for refining the local service backend pool updater retry and error handling behavior.

The design covers bounded updater-level retries, 429 throttling handling, 409/412 conflict retries, SDK retry boundaries, event semantics, queue merge behavior, shutdown behavior, metrics, and the expected unit test coverage.

Which issue(s) this PR fixes:

NONE

Special notes for your reviewer:

This is a design-only PR. It intentionally treats retryrepectthrottled.GetRetriableStatusCode() statuses as terminal at the updater layer because the Azure SDK retry policy already handles those statuses before the updater sees the error.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Adds a design doc under content/en/development/design-docs/.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/design Categorizes issue or PR as related to design. labels May 3, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 3, 2026

Deploy Preview for kubernetes-sigs-cloud-provide-azure ready!

Name Link
🔨 Latest commit e91308a
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-cloud-provide-azure/deploys/69f740d6d1bb1c0008c1a292
😎 Deploy Preview https://deploy-preview-10253--kubernetes-sigs-cloud-provide-azure.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nilo19

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 3, 2026
@github-actions github-actions Bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label May 3, 2026
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 3, 2026

The `removeOperation(serviceName)` method cannot remove operations already in a processing snapshot. To avoid stale retry behavior, the processing path must re-check service relevance before requeueing and before sending retry/failure events.

Parked operations waiting for `nextEligibleAt` live in `updater.operations`, so `removeOperation(serviceName)` can remove them while they are parked. If removal races with the short snapshot-to-requeue window, the next tick's relevance check drops the stale operation quietly.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the window where a removed service's operations are in-flight (snapshotted, ARM call in progress) could result in a successful CreateOrUpdate adding IPs for a service that was just deleted?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the removeOpeaation() acquire the lock during backendpoolupdater updating backend pool?


Updater-level retry uses the existing `LoadBalancerBackendPoolUpdateIntervalInSeconds` tick; this design does not add a separate sleep loop inside `process()`. Without `Retry-After`, default retry count `3` and default interval `30s` means a continuously failing retriable condition emits retrying events on the first three failed attempts and emits the final failed event on the fourth failed attempt, roughly 90 seconds after the first observed failure plus ARM call latency. Depending on where the first failure lands relative to the updater tick and how long each ARM call takes, the wall-clock time from the original EndpointSlice change to final failure can be close to or above two minutes.

For 429 throttling, `Retry-After` overrides the next normal updater tick by setting `nextEligibleAt`. Ticks before `nextEligibleAt` only preserve the operation in the queue after re-checking Service/LB relevance; they do not call ARM, emit retrying events, or consume retry budget. `LoadBalancerBackendPoolUpdateRetryCount` bounds failed processing attempts, not elapsed wall-clock time, so a long `Retry-After` can delay final success or failure beyond the normal interval-based timing.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If long Retry-After values park operations while EndpointSlice events keep arriving, the in-memory queue grows unboundedly. Is it likely for the queue to grow dangerously? Does the queue size needs to be bounded?

```go
// LoadBalancerBackendPoolUpdateRetryCount is the number of retries for retriable
// local-service backend-pool update failures. Defaults to 3.
LoadBalancerBackendPoolUpdateRetryCount *int `json:"loadBalancerBackendPoolUpdateRetryCount,omitempty" yaml:"loadBalancerBackendPoolUpdateRetryCount,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider LoadBalancerBackendPoolUpdateMaxRetries to make "max retries after first failure" clearer


## Metrics

The updater metric should describe terminal outcomes, not intermediate retry attempts.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a separate retry counter (e.g., backend_pool_update_retries_total) be useful for monitoring Azure API instability, separate from the terminal outcome metric?

Comment on lines +212 to +213
1. ARM wire `429` from `Get` with a parseable `Retry-After` in `azcore.ResponseError.RawResponse` sets `nextEligibleAt`; ticks before that time requeue quietly without ARM calls, retry events, retry-count increments, or metrics.
2. ARM wire `429` from `CreateOrUpdate` follows the same `Retry-After` and requeue behavior.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should 429 with Retry-After also emit LoadBalancerBackendPoolUpdateRetrying?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, as we requeue this. What is the concern here?

5. ARM `409` or `412` from `CreateOrUpdate` requeues, emits `LoadBalancerBackendPoolUpdateRetrying`, then succeeds on the next tick after a fresh `Get`.
6. If any operation in a `lbName/backendPoolName` group is waiting for `nextEligibleAt`, the whole group is preserved and no same-group operation is processed early.
7. A fresh operation merged with a requeued operation keeps its own retry counter; on group failure, all operations in the group consume one retry.
8. Retry budget exhaustion emits `LoadBalancerBackendPoolUpdateFailed` and leaves the queue empty.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding test case covering mixed-budget groups.

- Queue-preservation ticks while waiting for `nextEligibleAt` do not record a metric.
- Success records one successful observation when `LoadBalancerBackendPoolUpdated` is emitted.
- Terminal failure records one failed observation when `LoadBalancerBackendPoolUpdateFailed` is emitted.
- Stale resource-not-found and stale Service/LB drops record no observation, matching the existing quiet-skip behavior.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, a 404 from Get or CreateOrUpdate leaves isOperationSucceeded = false and the deferred ObserveOperationWithResult(false) still fires, recording a failure metric. This would be a behavior change.


## Retry Timing

Updater-level retry uses the existing `LoadBalancerBackendPoolUpdateIntervalInSeconds` tick; this design does not add a separate sleep loop inside `process()`. Without `Retry-After`, default retry count `3` and default interval `30s` means a continuously failing retriable condition emits retrying events on the first three failed attempts and emits the final failed event on the fourth failed attempt, roughly 90 seconds after the first observed failure plus ARM call latency. Depending on where the first failure lands relative to the updater tick and how long each ARM call takes, the wall-clock time from the original EndpointSlice change to final failure can be close to or above two minutes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reconciliation path uses exponential backoff. The updater reuses the tick loop. Was using exponential backoff considered for the updater retry?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants