Skip to content

Add support for zone-redundant load balancers#5944

Open
bryan-cox wants to merge 4 commits into
kubernetes-sigs:mainfrom
bryan-cox:issue-5709-lb-zone-redundancy
Open

Add support for zone-redundant load balancers#5944
bryan-cox wants to merge 4 commits into
kubernetes-sigs:mainfrom
bryan-cox:issue-5709-lb-zone-redundancy

Conversation

@bryan-cox
Copy link
Copy Markdown
Contributor

@bryan-cox bryan-cox commented Oct 24, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR implements support for configuring availability zones on Azure load balancers to enable zone-redundant configurations for high availability.

Azure load balancers can be configured as zone-redundant to ensure high availability across multiple availability zones within a region. This feature allows users to specify availability zones (1, 2, 3) on load balancers, which are then set on the frontend IP configurations.

Key changes:

  • Added AvailabilityZones field to LoadBalancerSpec API
  • Implemented service layer to set zones on frontend IP configurations
  • Added webhook validation to enforce Azure's zone immutability requirement
  • Integrated zone-redundant LB verification into the existing private cluster E2E test
  • Included comprehensive documentation with examples and migration guidance
  • Added unit tests

Which issue(s) this PR fixes:
Fixes #5709

Special notes for your reviewer:

Note

This PR was generated with the assistance of AI tooling (Claude Code).

This implementation follows Azure's zone redundancy model:

  • For internal load balancers: zones are set directly on frontend IP configurations
  • For public load balancers: zones should be set on associated public IP addresses (documented)
  • Zones are immutable after creation per Azure platform requirements
  • Webhook validation prevents invalid zone modifications

Zone-redundant LB verification is integrated into the private cluster E2E test per reviewer feedback, rather than as a standalone test.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • cherry-pick candidate

Release note:

Add support for zone-redundant load balancers. Users can now configure availability zones on load balancers (APIServerLB, NodeOutboundLB, ControlPlaneOutboundLB) to enable zone-redundant configurations for high availability across multiple availability zones.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/feature Categorizes issue or PR as related to a new feature. labels Oct 24, 2025
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 24, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jont828 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 24, 2025
@bryan-cox bryan-cox force-pushed the issue-5709-lb-zone-redundancy branch from 7572b39 to 67685f0 Compare October 24, 2025 17:04
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 24, 2025
@bryan-cox bryan-cox force-pushed the issue-5709-lb-zone-redundancy branch from 67685f0 to 2e5d373 Compare October 24, 2025 17:10
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 24, 2025
@bryan-cox bryan-cox force-pushed the issue-5709-lb-zone-redundancy branch from 2e5d373 to e29bc2e Compare October 24, 2025 17:13
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Oct 24, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Oct 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.03%. Comparing base (1360e4a) to head (6a57e82).
⚠️ Report is 62 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5944      +/-   ##
==========================================
+ Coverage   43.85%   44.03%   +0.17%     
==========================================
  Files         289      289              
  Lines       25341    25386      +45     
==========================================
+ Hits        11113    11178      +65     
+ Misses      13450    13431      -19     
+ Partials      778      777       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jackfrancis
Copy link
Copy Markdown
Contributor

@bryan-cox can you add this new functionality to the existing E2E scenario for a private cluster, which ships with/ an internal LB? E.g.:

$ git diff templates/flavors/private/patches/private-lb.yaml
diff --git a/templates/flavors/private/patches/private-lb.yaml b/templates/flavors/private/patches/private-lb.yaml
index 76e1539df..a2933e299 100644
--- a/templates/flavors/private/patches/private-lb.yaml
+++ b/templates/flavors/private/patches/private-lb.yaml
@@ -7,6 +7,10 @@ spec:
     apiServerLB:
       name: ${CLUSTER_NAME}-internal-lb
       type: Internal
+      availabilityZones:
+        - "1"
+        - "2"
+        - "3"
     nodeOutboundLB:
       frontendIPsCount: 1
     controlPlaneOutboundLB:

After you apply the above changes to the template partial above, render updated templates w/ kustomize by invoking make generate flavors from the git root directory.

cc @nojnhuh @mboersma

@jackfrancis
Copy link
Copy Markdown
Contributor

/test pull-cluster-api-provider-azure-e2e-optional

@bryan-cox bryan-cox force-pushed the issue-5709-lb-zone-redundancy branch from e29bc2e to 3b77777 Compare October 27, 2025 20:20
@bryan-cox
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@bryan-cox
Copy link
Copy Markdown
Contributor Author

/retest

@bryan-cox bryan-cox force-pushed the issue-5709-lb-zone-redundancy branch from 3b77777 to 6fa7de7 Compare October 28, 2025 10:34
@bryan-cox
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@bryan-cox bryan-cox force-pushed the issue-5709-lb-zone-redundancy branch from 36f5c8f to e69a17e Compare October 31, 2025 12:55
@bryan-cox
Copy link
Copy Markdown
Contributor Author

Attempting to get the PR back to its stable state before attempting to address #5944 (comment) again.

@bryan-cox
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 15, 2026
@bryan-cox
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@bryan-cox bryan-cox force-pushed the issue-5709-lb-zone-redundancy branch from eae35df to acc0ae9 Compare April 30, 2026 18:23
@k8s-ci-robot k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 30, 2026
@bryan-cox bryan-cox force-pushed the issue-5709-lb-zone-redundancy branch 5 times, most recently from 44ea9ba to b2749c7 Compare April 30, 2026 20:08
Add AvailabilityZones to LoadBalancerSpec for zone-redundant load
balancer support. For internal LBs, zones are applied to frontend IP
configurations. For public LBs, zones are applied to the associated
public IP addresses.

Includes webhook immutability validation for APIServerLB and
NodeOutboundLB zones, and CRD regeneration.
Add documentation page covering zone-redundant load balancer
configuration for both internal and public LBs, including examples,
migration guidance, and troubleshooting.
Configure availability zones ["1", "2", "3"] on the internal API server
load balancer in the private cluster flavor template and CI overlay.
Add unit tests for webhook immutability validation, PublicIPSpecs with
LB availability zones, and load balancer frontend IP zone configuration.
Add E2E verification of zone-redundant internal LB into the private
cluster test.
@bryan-cox bryan-cox force-pushed the issue-5709-lb-zone-redundancy branch from b2749c7 to 6a57e82 Compare April 30, 2026 20:09
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Apr 30, 2026
@bryan-cox
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e

@bryan-cox
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-optional

@bryan-cox
Copy link
Copy Markdown
Contributor Author

Retesting - the previous run failed due to an Azure DNS infrastructure flake. The management cluster API server DNS (capz-e2e-*-public-custom-vnet-*.canadacentral.cloudapp.azure.com) stopped resolving mid-test, causing the private cluster test and 5 other unrelated tests to fail at WaitForKubeadmControlPlaneMachinesToExist. No code changes since last run.

/test pull-cluster-api-provider-azure-e2e-optional

@bryan-cox
Copy link
Copy Markdown
Contributor Author

/retest

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

k8s-ci-robot commented May 7, 2026

@bryan-cox: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-capi-e2e 67685f0 link false /test pull-cluster-api-provider-azure-capi-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@bryan-cox
Copy link
Copy Markdown
Contributor Author

apiversion-upgrade failure is a pre-existing flake

The pull-cluster-api-provider-azure-apiversion-upgrade test is failing at ~50% rate across all PRs, not specific to this change. Evidence from recent Prow job history:

PR Author Change apiversion-upgrade results
#6262 @mboersma Bump CAPI to v1.13.1 5 consecutive failures before passing on 6th run
#6287 @willie-yao Update test metadata/versions 1 fail, 2 passes
#5944 this PR Zone-redundant LB 2 fails, 2 passes

PR #6262 is a simple dependency bump with no test-affecting changes — the same commit failed 5 times in a row before passing with zero code changes.

All failures share the same pattern: azureserviceoperator-controller-manager deployment timeout after clusterctl upgrade apply, followed by management cluster API server becoming unreachable (DNS resolution failures, TCP timeouts, HTTP/2 connection drops).

/retest pull-cluster-api-provider-azure-apiversion-upgrade

@bryan-cox
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-apiversion-upgrade

@willie-yao
Copy link
Copy Markdown
Contributor

@bryan-cox Is this ready for review?

@bryan-cox
Copy link
Copy Markdown
Contributor Author

@willie-yao hey 👋🏻 - yeah it is. I forgot to check it after running the test again.

@bryan-cox bryan-cox changed the title WIP: Add support for zone-redundant load balancers Add support for zone-redundant load balancers May 12, 2026
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

Load balancers are not zone redundant and can't be configured as such

4 participants