Fix: add dry-run AzureCluster create to ensure CA bundle availability#6221
Fix: add dry-run AzureCluster create to ensure CA bundle availability#6221mboersma wants to merge 3 commits into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #6221 +/- ##
==========================================
+ Coverage 43.66% 43.85% +0.19%
==========================================
Files 289 289
Lines 25495 25341 -154
==========================================
- Hits 11132 11113 -19
+ Misses 13561 13450 -111
+ Partials 802 778 -24 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
A validation error (Invalid/Forbidden) from the webhook proves TLS is working end-to-end, which is all the probe needs to verify. Only retry on errors that indicate TLS is not yet ready.
|
/label tide/merge-method-squash |
|
/test pull-cluster-api-provider-azure-e2e This bug isn't deterministic, so we can't easily know if this fixes it. I'll run tests a few times and we can make a judgement call. |
|
/test pull-cluster-api-provider-azure-e2e No failures yet... 🤞🏻 |
|
/test pull-cluster-api-provider-azure-e2e |
|
/test pull-cluster-api-provider-azure-e2e Ok, we got one failure but it was not #5960, instead we didn't get all nodes provisioned before the timeout. So this fix hasn't yet been invalidated. |
|
LGTM label has been added. DetailsGit tree hash: a057a77bcce4fe75e7684e3c6e6ef52c0794de1e |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: willie-yao The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
It looks like the latest run ran into the certificate error: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/6221/pull-cluster-api-provider-azure-e2e/2042693419544875008 |
|
/hold Checking most recent failure. |
The 2-minute timeout was not enough for the apiserver informer cache to converge in CI. Match the existing CA bundle injection timeout.
|
LGTM label has been added. DetailsGit tree hash: 36aa7e2ed55b58c81c1eeeab579cf1398d3b21b1 |
|
@mboersma feel free to remove hold if the failure is taken care of |
| // Even after the CABundle is populated on the webhook configuration, the | ||
| // kube-apiserver may not have picked up the updated config from its | ||
| // informer cache yet. Perform a dry-run create of an AzureCluster to | ||
| // verify the CAPZ mutating webhook is reachable end-to-end with valid TLS. |
There was a problem hiding this comment.
The big mystery for me is that even the default ApplyCusterTemplateAndWait does retries over the course of one minute. Either the webhooks work the first time or they fail several times in a row for 1 minute. Is this check basically doing that same thing but for 5 minutes? Have we seen cases in this PR where the webhooks fail for more than 1 minute but less than 5 minutes?
What type of PR is this?
/kind flake
What this PR does / why we need it:
Fixes a flaky e2e test failure where the kube-apiserver hasn't yet picked up updated webhook CA bundles from its informer cache, even though cert-manager's cainjector has already populated them on the webhook configurations: the well-known "x509 error."
After the existing check that waits for CA bundle injection into all ValidatingWebhookConfigurations and MutatingWebhookConfigurations, this adds a dry-run
AzureClustercreate to verify the CAPZ mutating webhook is actually reachable end-to-end with valid TLS. This closes the race window between the CA bundle being written and the apiserver serving requests through the webhook with the new certificate.Which issue(s) this PR fixes:
Fixes #5690 (hopefully)
See also #6144, which apparently didn't work. :-(
Special notes for your reviewer:
TODOs:
Release note: