[WIP] fix: clean up orphaned file share on CreateVolume failure#3151
[WIP] fix: clean up orphaned file share on CreateVolume failure#3151andyzhangx wants to merge 9 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andyzhangx The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
When restoring from a VolumeSnapshot, CreateVolume creates the destination file share before starting the azcopy data copy. If azcopy (or auth setup) fails, the function returns an error but never deletes the already-created file share, leaving it orphaned. Add best-effort cleanup (DeleteFileShare) at all three failure paths after the share has been created: 1. First getAzcopyAuth failure 2. Fallback getAzcopyAuth failure (SAS token retry) 3. copyVolume (azcopy) failure Cleanup errors are logged as warnings but do not mask the original error returned to the caller. Fixes kubernetes-sigs#3149
344feb9 to
1613d0e
Compare
There was a problem hiding this comment.
Pull request overview
This PR addresses an orphan-resource bug in the Azure File CSI driver by adding best-effort rollback deletion of a newly created destination file share when CreateVolume (snapshot restore / volume cloning path) fails after share creation. This helps prevent leaked Azure File Shares when azcopy/auth setup fails and no PV/snapshot is created to trigger normal CSI deletion flows.
Changes:
- Track whether the destination file share was created during
CreateVolumeand attempt best-effort cleanup on subsequent failures. - Add
DeleteFileSharecleanup ongetAzcopyAuthfailure, fallbackgetAzcopyAuthfailure, andcopyVolume(azcopy) failure. - Log cleanup failures as warnings while preserving the original error return.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…Exists When ShareAlreadyExists is returned during snapshot restore/cloning, distinguish between: - User-specified share (fileShareName != ''): skip cleanup since the share was pre-existing and not created by CSI - Auto-generated share (fileShareName == ''): keep cleanup enabled since ShareAlreadyExists likely means a previous CSI attempt created it but failed later (the exact orphan scenario)
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
pkg/azurefile/controllerserver.go:741
- The
ShareAlreadyExistserror branch togglesshareCreatedByCSIbased onfileShareName != "", but it doesn’t cover the common path whereCreateFileSharereturns nil because the share already exists (quota pre-check). That means a user-providedshareNamecan still be deleted on later azcopy/auth failures. Suggest explicitly settingshareCreatedByCSIto false wheneverfileShareNameis provided (or when the share is detected as already existing) before reaching the azcopy steps.
if req.GetVolumeContentSource() != nil && strings.Contains(err.Error(), "ShareAlreadyExists") {
// for snapshot restore and volume cloning, ignore ShareAlreadyExists error since the file share should be created first
klog.Warningf("create file share(%s) on account(%s) type(%s) subID(%s) rg(%s) location(%s) size(%d), ignore ShareAlreadyExists error for snapshot restore and volume cloning", validFileShareName, accountName, sku, subsID, resourceGroup, location, fileShareSize)
// If the share name was auto-generated by CSI (fileShareName is empty),
// ShareAlreadyExists likely means a previous CreateVolume attempt created it
// but failed later. We should still clean it up on subsequent failures.
// Only skip cleanup when the user explicitly specified a share name.
if fileShareName != "" {
shareCreatedByCSI = false
}
…e-existing shares When fileShareName is user-specified (non-empty), default shareCreatedByCSI to false since CreateFileShare silently succeeds when the share already exists (via internal GetFileShareQuota pre-check returning nil). This prevents cleanup from deleting a pre-existing user share on subsequent azcopy failures. For auto-generated share names (fileShareName is empty), CSI owns the lifecycle, so shareCreatedByCSI defaults to true — including the ShareAlreadyExists case (previous failed CSI attempt).
Extract the duplicated cleanup logic (3 call sites) into a local closure that takes a reason string for accurate log messages. Also fixes the fallback getAzcopyAuth log message to correctly say 'fallback getAzcopyAuth failure' instead of 'getAzcopyAuth failure'.
…und context - Move cleanupShareOnFailure from inline closure to a Driver method - Use background context with 2-minute timeout for cleanup to avoid inheriting a cancelled/expired context from the original request - Rename shareCreatedByCSI to shouldCleanupShare to better reflect the intent (cleanup eligibility, not creation tracking)
The helper always creates its own background context for cleanup, so the passed ctx was never used. Remove it to avoid confusion.
|
@andyzhangx, if CreateVolume times out because azcopy is taking a long time, am I right in thinking we would just clean up the next CreateVolume call if we get an error? And that timeout of azcopy wouldn't trigger a cleanup? I'm a bit worried that the ctx cancel would could a cleanup for example, when some larger file share could take a while to copy, resulting a loop where we'd be unable to make progress |
| klog.Warningf("azcopy copy failed with AuthorizationPermissionMismatch error, should assign \"Storage File Data Privileged Contributor\" role to controller identity, fall back to use sas token, original error: %v", copyErr) | ||
| accountSASToken, authAzcopyEnv, err := d.getAzcopyAuth(ctx, accountName, accountKey, storageEndpointSuffix, accountOptions, secret, secretName, secretNamespace, true) | ||
| if err != nil { | ||
| d.cleanupShareOnFailure(shouldCleanupShare, accountName, validFileShareName, subsID, resourceGroup, secret, useDataPlaneAPI, "fallback getAzcopyAuth failure") |
There was a problem hiding this comment.
nit:
| d.cleanupShareOnFailure(shouldCleanupShare, accountName, validFileShareName, subsID, resourceGroup, secret, useDataPlaneAPI, "fallback getAzcopyAuth failure") | |
| d.cleanupShareOnFailure(shouldCleanupShare, accountName, validFileShareName, subsID, resourceGroup, secret, useDataPlaneAPI, "sas token fallback getAzcopyAuth failure") |
| klog.Warningf("azcopy copy failed with AuthorizationPermissionMismatch error, should assign \"Storage File Data Privileged Contributor\" role to controller identity, fall back to use sas token, original error: %v", copyErr) | ||
| accountSASToken, authAzcopyEnv, err := d.getAzcopyAuth(ctx, accountName, accountKey, storageEndpointSuffix, accountOptions, secret, secretName, secretNamespace, true) | ||
| if err != nil { | ||
| d.cleanupShareOnFailure(shouldCleanupShare, accountName, validFileShareName, subsID, resourceGroup, secret, useDataPlaneAPI, "fallback getAzcopyAuth failure") |
There was a problem hiding this comment.
I believe we should also check status of ongoing azcopy job (might be triggered by previous reconciliations) before triggering cleanup.
Move the azcopy job status check into cleanupShareOnFailure so that all callers are protected. Before deleting the share, check GetAzcopyJob — if the job state is AzcopyJobRunning, preserve the share so retries can resume rather than starting from zero.
c805458 to
98bf8ae
Compare
|
@andyzhangx: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
mittachaitu
left a comment
There was a problem hiding this comment.
One minor comment other than that PR LGTM
| // Check if an azcopy job is still running for this share — if so, | ||
| // skip cleanup to avoid orphaning the job and losing partial progress. | ||
| jobState, _, err := d.azcopy.GetAzcopyJob(shareName, []string{}) | ||
| if err == nil && jobState == util.AzcopyJobRunning { |
There was a problem hiding this comment.
It is also good to skip deletion when job is succeeded (Helps to avoid deletion if there is race b/w original timeout and job completion.
What type of PR is this?
/kind bug
What this PR does / why we need it
When restoring a PVC from a VolumeSnapshot,
CreateVolumecreates the destination file share before starting the azcopy data copy. If azcopy or auth setup fails, the function returns an error but never deletes the already-created file share — leaving it orphaned in Azure Storage with no way to clean it up through normal CSI operations (no PV/snapshot exists to triggerDeleteVolume/DeleteSnapshot).This PR adds best-effort
DeleteFileSharecleanup at all three failure paths after the share has been created:getAzcopyAuthfailuregetAzcopyAuthfailure (SAS token retry afterAuthorizationPermissionMismatch)copyVolume(azcopy) failureCleanup errors are logged as warnings but do not mask the original error.
Which issue(s) this PR fixes
Fixes #3149
How to test