Skip to content

fix(multi-slb): serialize backendPoolUpdater with service reconcile loop#10328

Merged
nilo19 merged 2 commits into
kubernetes-sigs:masterfrom
Liunardy:liunardy/multi-slb-bp-update-race
May 19, 2026
Merged

fix(multi-slb): serialize backendPoolUpdater with service reconcile loop#10328
nilo19 merged 2 commits into
kubernetes-sigs:masterfrom
Liunardy:liunardy/multi-slb-bp-update-race

Conversation

@Liunardy
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

The backendPoolUpdater.process() performs backend pool updates without holding serviceReconcileLock or azureResourceLocker, allowing concurrent writes to the same backend pool from the updater and the main service reconciliation path.

This PR serializes the updater with the main reconciliation loop by acquiring serviceReconcileLock and azureResourceLocker in process() before ARM calls. The updater.lock is released before acquiring serviceReconcileLock and azureResourceLocker to avoid deadlock with the main path, which holds serviceReconcileLock and calls addOperation() (which acquires the updater.lock).

If the distributed lease lock cannot be acquired, operations are re-queued for retry on the next tick.

Which issue(s) this PR fixes:

Fixes #9839

Special notes for your reviewer:

Does this PR introduce a user-facing change?

fix: serialize backendPoolUpdater with service reconciliation to prevent concurrent backend pool writes in multi-standard-load-balancer configurations with externalTrafficPolicy: Local

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels May 12, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @Liunardy. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 12, 2026
@github-actions github-actions Bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label May 12, 2026
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 12, 2026
Comment thread pkg/provider/azure_local_services.go Outdated
// Serialize across multiple CCM replicas in HA deployments.
if updater.az.azureResourceLocker != nil {
if err := updater.az.azureResourceLocker.Lock(ctx); err != nil {
logger.Error(fmt.Errorf(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to re-log the error, just use a V(2) message to log the requeue behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed redundant error log. Operations are now preserved (not requeued) when lock fails, so no log needed. Also removed the unlock error log.

Comment thread pkg/provider/azure_local_services.go Outdated
// The reverse order would deadlock because the main reconcile path
// holds serviceReconcileLock and calls addOperation, which acquires
// updater.lock.
groups := updater.drainOperations()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design seems wrong.

  1. removeOperation calls from main loop may be useless after drainOpeartion.
  2. during the backendpool updater waiting for lock, the service can be changed, e.g., moving from 1 lb to another.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Restructured process() to acquire serviceReconcileLock before draining. This fixes both issues:

  1. removeOperation now works because the main reconcile loop holds serviceReconcileLock, which blocks process() before it can drain. Operations are still in the queue when removeOperation is called.
  2. groupOperations reads localServiceNameToServiceInfoMap under serviceReconcileLock, so it sees the latest LB assignment and filters operations targeting the old LB.

Added tests for both cases: TestLoadBalancerBackendPoolUpdaterFiltersOperationsWhenLBChangedDuringProcess and TestLoadBalancerBackendPoolUpdaterRemoveOperationCancelsOperationsBeforeDrain.

Comment thread pkg/provider/azure_local_services.go Outdated
"loadBalancerBackendPoolUpdater.process",
err,
), "Re-queuing operations for retry")
updater.requeueOperations(groups)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe can be fixed in the later refinement PR, but need to call these out:

  1. Unbounded retry. If another component is holding the lock, we will requeue at each tick and never stop doing that. Do we need a retry policy?
  2. For the next tick where the requeued operation is processed again, is it still a valid operation? Do we need to guard before processing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operations are now preserved in the queue when lock fails (they were never drained). On the next tick, groupOperations validates each operation against the current localServiceNameToServiceInfoMap under serviceReconcileLock, so stale operations should be filtered out.

"lb1:pool1": {addPool1},
},
},
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also cover negative cases such as failed to acquire azure resource lock.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added azureResourceLocker unlock fail case TestLoadBalancerBackendPoolUpdaterCompletesOnUnlockFailure. azureResourceLocker lock fail case is now TestLoadBalancerBackendPoolUpdaterPreservesOperationsOnLeaseLockFailure.

Comment thread pkg/provider/azure_local_services.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ObserveOperationWithResult be called at the end of each iteration rather than deferred? Since defer executes at process function exit, the recorded latency for earlier iterations will include the processing time of all subsequent operations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I was only checking isOperationSucceeded value and didn't check when the defer executes. Removed the defer and call ObserveOperationWithResult directly instead.

@Liunardy Liunardy force-pushed the liunardy/multi-slb-bp-update-race branch from c04ca7c to a3c0216 Compare May 14, 2026 11:31
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 14, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Liunardy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 14, 2026
Comment thread pkg/provider/azure_local_services.go Outdated
// Must be called under serviceReconcileLock so that
// localServiceNameToServiceInfoMap reads are consistent.
func (updater *loadBalancerBackendPoolUpdater) groupOperations(ops []batchOperation) map[string][]batchOperation {
logger := log.Background().WithName("loadBalancerBackendPoolUpdater.groupOperations")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be log.FromContextOrBackground(ctx)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@nilo19
Copy link
Copy Markdown
Contributor

nilo19 commented May 15, 2026

/retest

@Liunardy Liunardy force-pushed the liunardy/multi-slb-bp-update-race branch from a3c0216 to 4da465e Compare May 18, 2026 03:16
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@Liunardy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cloud-provider-azure-e2e-capz 4da465e link true /test pull-cloud-provider-azure-e2e-capz

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@nilo19 nilo19 removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 19, 2026
@nilo19
Copy link
Copy Markdown
Contributor

nilo19 commented May 19, 2026

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2026
@nilo19 nilo19 merged commit a3c548d into kubernetes-sigs:master May 19, 2026
22 of 24 checks passed
@Liunardy
Copy link
Copy Markdown
Contributor Author

/cherrypick release-1.33
/cherrypick release-1.34
/cherrypick release-1.35
/cherrypick release-1.36

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown

@Liunardy: new pull request created: #10389

Details

In response to this:

/cherrypick release-1.33
/cherrypick release-1.34
/cherrypick release-1.35
/cherrypick release-1.36

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown

@Liunardy: new pull request created: #10390

Details

In response to this:

/cherrypick release-1.33
/cherrypick release-1.34
/cherrypick release-1.35
/cherrypick release-1.36

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown

@Liunardy: new pull request created: #10391

Details

In response to this:

/cherrypick release-1.33
/cherrypick release-1.34
/cherrypick release-1.35
/cherrypick release-1.36

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown

@Liunardy: new pull request created: #10392

Details

In response to this:

/cherrypick release-1.33
/cherrypick release-1.34
/cherrypick release-1.35
/cherrypick release-1.36

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential race: backendPoolUpdater concurrent with Service LB reconciliation for local services

5 participants