Fix: add dry-run AzureCluster create to ensure CA bundle availability by mboersma · Pull Request #6221 · kubernetes-sigs/cluster-api-provider-azure

mboersma · 2026-04-08T19:58:49Z

What type of PR is this?

/kind flake

What this PR does / why we need it:

Fixes a flaky e2e test failure where the kube-apiserver hasn't yet picked up updated webhook CA bundles from its informer cache, even though cert-manager's cainjector has already populated them on the webhook configurations: the well-known "x509 error."

After the existing check that waits for CA bundle injection into all ValidatingWebhookConfigurations and MutatingWebhookConfigurations, this adds a dry-run AzureCluster create to verify the CAPZ mutating webhook is actually reachable end-to-end with valid TLS. This closes the race window between the CA bundle being written and the apiserver serving requests through the webhook with the new certificate.

Which issue(s) this PR fixes:
Fixes #5690 (hopefully)
See also #6144, which apparently didn't work. :-(

Special notes for your reviewer:

TODOs:

squashed commits
includes documentation
adds unit tests
cherry-pick candidate

Release note:

NONE

codecov · 2026-04-08T20:01:56Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.85%. Comparing base (d359ee9) to head (08f5da0).
⚠️ Report is 24 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6221      +/-   ##
==========================================
+ Coverage   43.66%   43.85%   +0.19%     
==========================================
  Files         289      289              
  Lines       25495    25341     -154     
==========================================
- Hits        11132    11113      -19     
+ Misses      13561    13450     -111     
+ Partials      802      778      -24

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

A validation error (Invalid/Forbidden) from the webhook proves TLS is working end-to-end, which is all the probe needs to verify. Only retry on errors that indicate TLS is not yet ready.

mboersma · 2026-04-09T02:41:00Z

/label tide/merge-method-squash

mboersma · 2026-04-09T17:21:38Z

/test pull-cluster-api-provider-azure-e2e

This bug isn't deterministic, so we can't easily know if this fixes it. I'll run tests a few times and we can make a judgement call.

mboersma · 2026-04-09T19:15:42Z

/test pull-cluster-api-provider-azure-e2e

No failures yet... 🤞🏻

mboersma · 2026-04-10T16:50:15Z

/test pull-cluster-api-provider-azure-e2e

mboersma · 2026-04-10T19:57:13Z

/test pull-cluster-api-provider-azure-e2e

Ok, we got one failure but it was not #5960, instead we didn't get all nodes provisioned before the timeout. So this fix hasn't yet been invalidated.

willie-yao

/lgtm
/approve

Great work! Since the earlier PR didn't fix it, I'm in favor of reverting it. wdyt? Especially the part of changing rotationPolicy: Never, which I think unintentionally messed with Tilt reloads

k8s-ci-robot · 2026-04-10T20:13:59Z

LGTM label has been added.

Details

Git tree hash: a057a77bcce4fe75e7684e3c6e6ef52c0794de1e

k8s-ci-robot · 2026-04-10T20:14:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: willie-yao

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [willie-yao]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

willie-yao · 2026-04-10T20:16:51Z

It looks like the latest run ran into the certificate error: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/6221/pull-cluster-api-provider-azure-e2e/2042693419544875008

mboersma · 2026-04-10T20:17:38Z

/hold

Checking most recent failure.

The 2-minute timeout was not enough for the apiserver informer cache to converge in CI. Match the existing CA bundle injection timeout.

willie-yao

/lgtm

k8s-ci-robot · 2026-04-20T21:26:38Z

LGTM label has been added.

Details

Git tree hash: 36aa7e2ed55b58c81c1eeeab579cf1398d3b21b1

willie-yao · 2026-04-21T00:39:43Z

@mboersma feel free to remove hold if the failure is taken care of

nojnhuh · 2026-04-21T19:43:49Z

+	// Even after the CABundle is populated on the webhook configuration, the
+	// kube-apiserver may not have picked up the updated config from its
+	// informer cache yet. Perform a dry-run create of an AzureCluster to
+	// verify the CAPZ mutating webhook is reachable end-to-end with valid TLS.


The big mystery for me is that even the default ApplyCusterTemplateAndWait does retries over the course of one minute. Either the webhooks work the first time or they fail several times in a row for 1 minute. Is this check basically doing that same thing but for 5 minutes? Have we seen cases in this PR where the webhooks fail for more than 1 minute but less than 5 minutes?

Fix: add dry-run AzureCluster create to ensure CA bundle availability

99a0162

github-project-automation Bot added this to CAPZ Planning Apr 8, 2026

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. labels Apr 8, 2026

github-project-automation Bot moved this to Todo in CAPZ Planning Apr 8, 2026

k8s-ci-robot requested review from bryan-cox and jsturtevant April 8, 2026 19:58

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 8, 2026

mboersma changed the title ~~Fix: add dry-run AzureCluster create to ensure CA bundle availability~~ [WIP] Fix: add dry-run AzureCluster create to ensure CA bundle availability Apr 8, 2026

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 8, 2026

Treat webhook validation rejections as success in dry-run probe

6a44fd3

A validation error (Invalid/Forbidden) from the webhook proves TLS is working end-to-end, which is all the probe needs to verify. Only retry on errors that indicate TLS is not yet ready.

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 9, 2026

k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Apr 9, 2026

mboersma changed the title ~~[WIP] Fix: add dry-run AzureCluster create to ensure CA bundle availability~~ Fix: add dry-run AzureCluster create to ensure CA bundle availability Apr 9, 2026

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 9, 2026

mboersma requested review from jackfrancis and willie-yao and removed request for jsturtevant April 10, 2026 16:54

willie-yao approved these changes Apr 10, 2026

View reviewed changes

k8s-ci-robot assigned willie-yao Apr 10, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 10, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 10, 2026

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 10, 2026

Increase dry-run webhook probe timeout to 5 minutes

08f5da0

The 2-minute timeout was not enough for the apiserver informer cache to converge in CI. Match the existing CA bundle injection timeout.

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 20, 2026

k8s-ci-robot requested a review from willie-yao April 20, 2026 16:33

willie-yao approved these changes Apr 20, 2026

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 20, 2026

nojnhuh reviewed Apr 21, 2026

View reviewed changes

Conversation

mboersma commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mboersma commented Apr 9, 2026

Uh oh!

mboersma commented Apr 9, 2026

Uh oh!

mboersma commented Apr 9, 2026

Uh oh!

mboersma commented Apr 10, 2026

Uh oh!

mboersma commented Apr 10, 2026

Uh oh!

willie-yao left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Apr 10, 2026

Uh oh!

k8s-ci-robot commented Apr 10, 2026

Uh oh!

willie-yao commented Apr 10, 2026

Uh oh!

mboersma commented Apr 10, 2026

Uh oh!

willie-yao left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Apr 20, 2026

Uh oh!

willie-yao commented Apr 21, 2026

Uh oh!

nojnhuh Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mboersma commented Apr 8, 2026 •

edited

Loading

codecov Bot commented Apr 8, 2026 •

edited

Loading

willie-yao left a comment •

edited

Loading