Skip to content

Fix: add dry-run AzureCluster create to ensure CA bundle availability#6221

Open
mboersma wants to merge 3 commits into
kubernetes-sigs:mainfrom
mboersma:fix-webhook-ca-flake
Open

Fix: add dry-run AzureCluster create to ensure CA bundle availability#6221
mboersma wants to merge 3 commits into
kubernetes-sigs:mainfrom
mboersma:fix-webhook-ca-flake

Conversation

@mboersma
Copy link
Copy Markdown
Contributor

@mboersma mboersma commented Apr 8, 2026

What type of PR is this?

/kind flake

What this PR does / why we need it:

Fixes a flaky e2e test failure where the kube-apiserver hasn't yet picked up updated webhook CA bundles from its informer cache, even though cert-manager's cainjector has already populated them on the webhook configurations: the well-known "x509 error."

After the existing check that waits for CA bundle injection into all ValidatingWebhookConfigurations and MutatingWebhookConfigurations, this adds a dry-run AzureCluster create to verify the CAPZ mutating webhook is actually reachable end-to-end with valid TLS. This closes the race window between the CA bundle being written and the apiserver serving requests through the webhook with the new certificate.

Which issue(s) this PR fixes:
Fixes #5690 (hopefully)
See also #6144, which apparently didn't work. :-(

Special notes for your reviewer:

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • cherry-pick candidate

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. labels Apr 8, 2026
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 8, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.85%. Comparing base (d359ee9) to head (08f5da0).
⚠️ Report is 24 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6221      +/-   ##
==========================================
+ Coverage   43.66%   43.85%   +0.19%     
==========================================
  Files         289      289              
  Lines       25495    25341     -154     
==========================================
- Hits        11132    11113      -19     
+ Misses      13561    13450     -111     
+ Partials      802      778      -24     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mboersma mboersma changed the title Fix: add dry-run AzureCluster create to ensure CA bundle availability [WIP] Fix: add dry-run AzureCluster create to ensure CA bundle availability Apr 8, 2026
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 8, 2026
A validation error (Invalid/Forbidden) from the webhook proves
TLS is working end-to-end, which is all the probe needs to verify.
Only retry on errors that indicate TLS is not yet ready.
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 9, 2026
@mboersma
Copy link
Copy Markdown
Contributor Author

mboersma commented Apr 9, 2026

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Apr 9, 2026
@mboersma mboersma changed the title [WIP] Fix: add dry-run AzureCluster create to ensure CA bundle availability Fix: add dry-run AzureCluster create to ensure CA bundle availability Apr 9, 2026
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 9, 2026
@mboersma
Copy link
Copy Markdown
Contributor Author

mboersma commented Apr 9, 2026

/test pull-cluster-api-provider-azure-e2e

This bug isn't deterministic, so we can't easily know if this fixes it. I'll run tests a few times and we can make a judgement call.

@mboersma
Copy link
Copy Markdown
Contributor Author

mboersma commented Apr 9, 2026

/test pull-cluster-api-provider-azure-e2e

No failures yet... 🤞🏻

@mboersma
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e

@mboersma mboersma requested review from jackfrancis and willie-yao and removed request for jsturtevant April 10, 2026 16:54
@mboersma
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e

Ok, we got one failure but it was not #5960, instead we didn't get all nodes provisioned before the timeout. So this fix hasn't yet been invalidated.

Copy link
Copy Markdown
Contributor

@willie-yao willie-yao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Great work! Since the earlier PR didn't fix it, I'm in favor of reverting it. wdyt? Especially the part of changing rotationPolicy: Never, which I think unintentionally messed with Tilt reloads

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 10, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: a057a77bcce4fe75e7684e3c6e6ef52c0794de1e

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: willie-yao

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 10, 2026
@willie-yao
Copy link
Copy Markdown
Contributor

@mboersma
Copy link
Copy Markdown
Contributor Author

/hold

Checking most recent failure.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 10, 2026
The 2-minute timeout was not enough for the apiserver informer cache
to converge in CI. Match the existing CA bundle injection timeout.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 20, 2026
@k8s-ci-robot k8s-ci-robot requested a review from willie-yao April 20, 2026 16:33
Copy link
Copy Markdown
Contributor

@willie-yao willie-yao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 20, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 36aa7e2ed55b58c81c1eeeab579cf1398d3b21b1

@willie-yao
Copy link
Copy Markdown
Contributor

@mboersma feel free to remove hold if the failure is taken care of

Comment thread test/e2e/helpers.go
Comment on lines +860 to +863
// Even after the CABundle is populated on the webhook configuration, the
// kube-apiserver may not have picked up the updated config from its
// informer cache yet. Perform a dry-run create of an AzureCluster to
// verify the CAPZ mutating webhook is reachable end-to-end with valid TLS.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big mystery for me is that even the default ApplyCusterTemplateAndWait does retries over the course of one minute. Either the webhooks work the first time or they fail several times in a row for 1 minute. Is this check basically doing that same thing but for 5 minutes? Have we seen cases in this PR where the webhooks fail for more than 1 minute but less than 5 minutes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

Webhooks sometimes fail with certificate errors in e2e

4 participants