Skip to content

test: fix workload identity e2e test#3154

Merged
andyzhangx merged 6 commits into
kubernetes-sigs:masterfrom
andyzhangx:test-premium-lrs
May 16, 2026
Merged

test: fix workload identity e2e test#3154
andyzhangx merged 6 commits into
kubernetes-sigs:masterfrom
andyzhangx:test-premium-lrs

Conversation

@andyzhangx
Copy link
Copy Markdown
Member

@andyzhangx andyzhangx commented May 16, 2026

What this PR does

Improves the workload identity (WI) e2e test reliability by porting the AAD OIDC cache warm-up infrastructure from kubernetes-sigs/blob-csi-driver#2445.

Changes

test/e2e/suite_test.go:

  • setupWorkloadIdentity() now returns the client ID for background warm-up
  • Added waitForOIDCJWKS() — polls OIDC JWKS endpoint until signing keys are available
  • Added verifyJWKSKeyMatch() — detects and repairs CAPZ JWKS key mismatch (root cause of AADSTS7000272)
  • Added waitForAADTokenExchange() — background goroutine polls AAD token exchange for up to 45min, running in parallel with other tests
  • WI test blocks on <-wiReady channel to ensure AAD cache is warm before mounting

test/e2e/dynamic_provisioning_test.go:

  • Changed WI test ClaimSize from 10Gi to 100Gi (Premium_LRS minimum share is 100GiB)
  • Changed skuName from Standard_LRS to Premium_LRS
  • Test now waits for AAD warm-up completion before running

test/e2e/testsuites/:

  • Removed unused SetAutomountServiceAccountToken() — CSI tokenRequests are handled by kubelet, no in-container token mount needed

Why

AAD independently caches OIDC metadata and can take 20-30+ minutes to accept token exchanges for a new OIDC issuer. Without the warm-up, the WI test pod gets stuck in Pending state waiting for the CSI volume mount, which relies on a successful token exchange.

Additionally, CAPZ clusters have a known race condition where the JWKS blob in Azure Storage may contain a stale signing key different from what kube-apiserver actually uses, causing permanent AADSTS7000272 rejections.

Related: kubernetes-sigs/blob-csi-driver#2445, #3118

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from cvvz and gnufied May 16, 2026 02:24
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 16, 2026
Premium_LRS has a minimum share size of 100GiB. The test was requesting
10Gi which Azure auto-expanded to 100Gi, causing the pvCapacity
assertion to fail.
…ad identity e2e test

Port critical workload identity infrastructure from blob-csi-driver PR kubernetes-sigs#2445:

1. Background AAD token exchange warm-up (waitForAADTokenExchange) - polls AAD
   for up to 45min until token exchange succeeds, running in parallel with other
   tests to avoid blocking the suite.

2. OIDC JWKS readiness check (waitForOIDCJWKS) - ensures the JWKS endpoint
   returns valid signing keys before proceeding.

3. CAPZ JWKS key mismatch detection and repair (verifyJWKSKeyMatch) - detects
   when blob-hosted JWKS has different signing keys than kube-apiserver and
   re-uploads the correct JWKS. Without this, AAD permanently rejects token
   exchanges with AADSTS7000272.

4. WI test now waits on wiReady channel for warm-up completion before running,
   preventing the 30min pod Pending timeout.

5. setupWorkloadIdentity now returns (clientID, error) to pass the client ID
   to the background warm-up goroutine.

Also adds github.com/Azure/azure-sdk-for-go/sdk/storage/azblob dependency
for blob upload operations.
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 16, 2026
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 16, 2026
…t false for WI

CSI tokenRequests for workload identity are handled by kubelet based on
the pod's service account; no in-container token mount is needed.
@andyzhangx andyzhangx changed the title test: change WI mount test to use Premium_LRS test: add AAD OIDC warm-up for workload identity e2e test May 16, 2026
@andyzhangx andyzhangx changed the title test: add AAD OIDC warm-up for workload identity e2e test test: fix workload identity e2e test May 16, 2026
@andyzhangx andyzhangx requested a review from Copilot May 16, 2026 12:43
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

@andyzhangx andyzhangx requested a review from Copilot May 16, 2026 12:45
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

@andyzhangx andyzhangx merged commit d40accc into kubernetes-sigs:master May 16, 2026
21 of 22 checks passed
@andyzhangx
Copy link
Copy Markdown
Member Author

/cherrypick release-1.35

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown

@andyzhangx: Failed to get PR patch from GitHub. This PR will need to be manually cherrypicked.

Error messagestatus code 406 not one of [200], body: {"message":"Sorry, the diff exceeded the maximum number of lines (20000)","errors":[{"resource":"PullRequest","field":"diff","code":"too_large"}],"documentation_url":"https://docs.github.com/rest/pulls/pulls#get-a-pull-request","status":"406"}
Details

In response to this:

/cherrypick release-1.35

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants