fix(multi-slb): serialize backendPoolUpdater with service reconcile loop by Liunardy · Pull Request #10328 · kubernetes-sigs/cloud-provider-azure

Liunardy · 2026-05-12T08:09:27Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

The backendPoolUpdater.process() performs backend pool updates without holding serviceReconcileLock or azureResourceLocker, allowing concurrent writes to the same backend pool from the updater and the main service reconciliation path.

This PR serializes the updater with the main reconciliation loop by acquiring serviceReconcileLock and azureResourceLocker in process() before ARM calls. The updater.lock is released before acquiring serviceReconcileLock and azureResourceLocker to avoid deadlock with the main path, which holds serviceReconcileLock and calls addOperation() (which acquires the updater.lock).

If the distributed lease lock cannot be acquired, operations are re-queued for retry on the next tick.

Which issue(s) this PR fixes:

Fixes #9839

Special notes for your reviewer:

Does this PR introduce a user-facing change?

fix: serialize backendPoolUpdater with service reconciliation to prevent concurrent backend pool writes in multi-standard-load-balancer configurations with externalTrafficPolicy: Local

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2026-05-12T08:09:37Z

Hi @Liunardy. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

nilo19 · 2026-05-14T01:07:50Z

+	// Serialize across multiple CCM replicas in HA deployments.
+	if updater.az.azureResourceLocker != nil {
+		if err := updater.az.azureResourceLocker.Lock(ctx); err != nil {
+			logger.Error(fmt.Errorf(


No need to re-log the error, just use a V(2) message to log the requeue behavior.

Removed redundant error log. Operations are now preserved (not requeued) when lock fails, so no log needed. Also removed the unlock error log.

nilo19 · 2026-05-14T01:23:26Z

+	// The reverse order would deadlock because the main reconcile path
+	// holds serviceReconcileLock and calls addOperation, which acquires
+	// updater.lock.
+	groups := updater.drainOperations()


This design seems wrong.

removeOperation calls from main loop may be useless after drainOpeartion.

during the backendpool updater waiting for lock, the service can be changed, e.g., moving from 1 lb to another.

Good catch! Restructured process() to acquire serviceReconcileLock before draining. This fixes both issues:

removeOperation now works because the main reconcile loop holds serviceReconcileLock, which blocks process() before it can drain. Operations are still in the queue when removeOperation is called.

groupOperations reads localServiceNameToServiceInfoMap under serviceReconcileLock, so it sees the latest LB assignment and filters operations targeting the old LB.

Added tests for both cases: TestLoadBalancerBackendPoolUpdaterFiltersOperationsWhenLBChangedDuringProcess and TestLoadBalancerBackendPoolUpdaterRemoveOperationCancelsOperationsBeforeDrain.

nilo19 · 2026-05-14T01:26:34Z

+				"loadBalancerBackendPoolUpdater.process",
+				err,
+			), "Re-queuing operations for retry")
+			updater.requeueOperations(groups)


Maybe can be fixed in the later refinement PR, but need to call these out:

Unbounded retry. If another component is holding the lock, we will requeue at each tick and never stop doing that. Do we need a retry policy?

For the next tick where the requeued operation is processed again, is it still a valid operation? Do we need to guard before processing?

Operations are now preserved in the queue when lock fails (they were never drained). On the next tick, groupOperations validates each operation against the current localServiceNameToServiceInfoMap under serviceReconcileLock, so stale operations should be filtered out.

nilo19 · 2026-05-14T01:30:43Z

+				"lb1:pool1": {addPool1},
+			},
+		},
+	}


Please also cover negative cases such as failed to acquire azure resource lock.

Added azureResourceLocker unlock fail case TestLoadBalancerBackendPoolUpdaterCompletesOnUnlockFailure. azureResourceLocker lock fail case is now TestLoadBalancerBackendPoolUpdaterPreservesOperationsOnLeaseLockFailure.

anndono · 2026-05-14T03:21:50Z

Should ObserveOperationWithResult be called at the end of each iteration rather than deferred? Since defer executes at process function exit, the recorded latency for earlier iterations will include the processing time of all subsequent operations.

Good catch! I was only checking isOperationSucceeded value and didn't check when the defer executes. Removed the defer and call ObserveOperationWithResult directly instead.

…ion loop

k8s-ci-robot · 2026-05-14T11:31:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Liunardy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [Liunardy]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nilo19 · 2026-05-15T13:47:00Z

+// Must be called under serviceReconcileLock so that
+// localServiceNameToServiceInfoMap reads are consistent.
+func (updater *loadBalancerBackendPoolUpdater) groupOperations(ops []batchOperation) map[string][]batchOperation {
+	logger := log.Background().WithName("loadBalancerBackendPoolUpdater.groupOperations")


should be log.FromContextOrBackground(ctx)

nilo19 · 2026-05-15T13:51:22Z

/retest

k8s-ci-robot · 2026-05-18T04:23:54Z

@Liunardy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cloud-provider-azure-e2e-capz	`4da465e`	link	true	`/test pull-cloud-provider-azure-e2e-capz`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

nilo19 · 2026-05-19T01:46:27Z

/lgtm

Liunardy · 2026-05-19T01:56:30Z

/cherrypick release-1.33
/cherrypick release-1.34
/cherrypick release-1.35
/cherrypick release-1.36

k8s-infra-cherrypick-robot · 2026-05-19T01:57:22Z

@Liunardy: new pull request created: #10389

Details

In response to this:

/cherrypick release-1.33
/cherrypick release-1.34
/cherrypick release-1.35
/cherrypick release-1.36

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-infra-cherrypick-robot · 2026-05-19T01:57:59Z

@Liunardy: new pull request created: #10390

Details

In response to this:

/cherrypick release-1.33
/cherrypick release-1.34
/cherrypick release-1.35
/cherrypick release-1.36

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-infra-cherrypick-robot · 2026-05-19T01:58:36Z

@Liunardy: new pull request created: #10391

Details

In response to this:

/cherrypick release-1.33
/cherrypick release-1.34
/cherrypick release-1.35
/cherrypick release-1.36

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-infra-cherrypick-robot · 2026-05-19T01:59:12Z

@Liunardy: new pull request created: #10392

Details

In response to this:

/cherrypick release-1.33
/cherrypick release-1.34
/cherrypick release-1.35
/cherrypick release-1.36

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels May 12, 2026

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 12, 2026

github-actions Bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label May 12, 2026

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 12, 2026

k8s-ci-robot requested review from andyzhangx and bridgetkromhout May 12, 2026 08:09

nilo19 reviewed May 14, 2026

View reviewed changes

anndono reviewed May 14, 2026

View reviewed changes

fix(multi-slb): serialize backendPoolUpdater with service reconciliat…

4921809

…ion loop

Liunardy force-pushed the liunardy/multi-slb-bp-update-race branch from c04ca7c to a3c0216 Compare May 14, 2026 11:31

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 14, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 14, 2026

nilo19 reviewed May 15, 2026

View reviewed changes

fix(multi-slb): address comments

4da465e

Liunardy force-pushed the liunardy/multi-slb-bp-update-race branch from a3c0216 to 4da465e Compare May 18, 2026 03:16

nilo19 removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 19, 2026

k8s-ci-robot assigned nilo19 May 19, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2026

nilo19 merged commit a3c548d into kubernetes-sigs:master May 19, 2026
22 of 24 checks passed

k8s-infra-cherrypick-robot mentioned this pull request May 19, 2026

[release-1.33] fix(multi-slb): serialize backendPoolUpdater with service reconcile loop #10389

Merged

k8s-infra-cherrypick-robot mentioned this pull request May 19, 2026

[release-1.34] fix(multi-slb): serialize backendPoolUpdater with service reconcile loop #10390

Merged

k8s-infra-cherrypick-robot mentioned this pull request May 19, 2026

[release-1.35] fix(multi-slb): serialize backendPoolUpdater with service reconcile loop #10391

Merged

k8s-infra-cherrypick-robot mentioned this pull request May 19, 2026

[release-1.36] fix(multi-slb): serialize backendPoolUpdater with service reconcile loop #10392

Merged

Conversation

Liunardy commented May 12, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented May 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented May 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nilo19 commented May 15, 2026

Uh oh!

k8s-ci-robot commented May 18, 2026

Uh oh!

nilo19 commented May 19, 2026

Uh oh!

Uh oh!

Liunardy commented May 19, 2026

Uh oh!

k8s-infra-cherrypick-robot commented May 19, 2026

Uh oh!

k8s-infra-cherrypick-robot commented May 19, 2026

Uh oh!

k8s-infra-cherrypick-robot commented May 19, 2026

Uh oh!

k8s-infra-cherrypick-robot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants