[WIP] fix: clean up orphaned file share on CreateVolume failure by andyzhangx · Pull Request #3151 · kubernetes-sigs/azurefile-csi-driver

andyzhangx · 2026-05-13T02:15:28Z

What type of PR is this?

/kind bug

What this PR does / why we need it

When restoring a PVC from a VolumeSnapshot, CreateVolume creates the destination file share before starting the azcopy data copy. If azcopy or auth setup fails, the function returns an error but never deletes the already-created file share — leaving it orphaned in Azure Storage with no way to clean it up through normal CSI operations (no PV/snapshot exists to trigger DeleteVolume/DeleteSnapshot).

This PR adds best-effort DeleteFileShare cleanup at all three failure paths after the share has been created:

First getAzcopyAuth failure
Fallback getAzcopyAuth failure (SAS token retry after AuthorizationPermissionMismatch)
copyVolume (azcopy) failure

Cleanup errors are logged as warnings but do not mask the original error.

Which issue(s) this PR fixes

Fixes #3149

How to test

Follow the reproduction steps in Orphaned File Share left on CreateVolume failure #3149 (firewall the destination storage account data plane)
Create a PVC from snapshot targeting the firewalled account
Verify that azcopy fails as expected
Confirm the file share is cleaned up automatically (no orphan left in the storage account)

k8s-ci-robot · 2026-05-13T02:15:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [andyzhangx]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

When restoring from a VolumeSnapshot, CreateVolume creates the destination file share before starting the azcopy data copy. If azcopy (or auth setup) fails, the function returns an error but never deletes the already-created file share, leaving it orphaned. Add best-effort cleanup (DeleteFileShare) at all three failure paths after the share has been created: 1. First getAzcopyAuth failure 2. Fallback getAzcopyAuth failure (SAS token retry) 3. copyVolume (azcopy) failure Cleanup errors are logged as warnings but do not mask the original error returned to the caller. Fixes kubernetes-sigs#3149

Copilot

Pull request overview

This PR addresses an orphan-resource bug in the Azure File CSI driver by adding best-effort rollback deletion of a newly created destination file share when CreateVolume (snapshot restore / volume cloning path) fails after share creation. This helps prevent leaked Azure File Shares when azcopy/auth setup fails and no PV/snapshot is created to trigger normal CSI deletion flows.

Changes:

Track whether the destination file share was created during CreateVolume and attempt best-effort cleanup on subsequent failures.
Add DeleteFileShare cleanup on getAzcopyAuth failure, fallback getAzcopyAuth failure, and copyVolume (azcopy) failure.
Log cleanup failures as warnings while preserving the original error return.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…Exists When ShareAlreadyExists is returned during snapshot restore/cloning, distinguish between: - User-specified share (fileShareName != ''): skip cleanup since the share was pre-existing and not created by CSI - Auto-generated share (fileShareName == ''): keep cleanup enabled since ShareAlreadyExists likely means a previous CSI attempt created it but failed later (the exact orphan scenario)

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

pkg/azurefile/controllerserver.go:741

The ShareAlreadyExists error branch toggles shareCreatedByCSI based on fileShareName != "", but it doesn’t cover the common path where CreateFileShare returns nil because the share already exists (quota pre-check). That means a user-provided shareName can still be deleted on later azcopy/auth failures. Suggest explicitly setting shareCreatedByCSI to false whenever fileShareName is provided (or when the share is detected as already existing) before reaching the azcopy steps.

		if req.GetVolumeContentSource() != nil && strings.Contains(err.Error(), "ShareAlreadyExists") {
			// for snapshot restore and volume cloning, ignore ShareAlreadyExists error since the file share should be created first
			klog.Warningf("create file share(%s) on account(%s) type(%s) subID(%s) rg(%s) location(%s) size(%d), ignore ShareAlreadyExists error for snapshot restore and volume cloning", validFileShareName, accountName, sku, subsID, resourceGroup, location, fileShareSize)
			// If the share name was auto-generated by CSI (fileShareName is empty),
			// ShareAlreadyExists likely means a previous CreateVolume attempt created it
			// but failed later. We should still clean it up on subsequent failures.
			// Only skip cleanup when the user explicitly specified a share name.
			if fileShareName != "" {
				shareCreatedByCSI = false
			}

…e-existing shares When fileShareName is user-specified (non-empty), default shareCreatedByCSI to false since CreateFileShare silently succeeds when the share already exists (via internal GetFileShareQuota pre-check returning nil). This prevents cleanup from deleting a pre-existing user share on subsequent azcopy failures. For auto-generated share names (fileShareName is empty), CSI owns the lifecycle, so shareCreatedByCSI defaults to true — including the ShareAlreadyExists case (previous failed CSI attempt).

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Extract the duplicated cleanup logic (3 call sites) into a local closure that takes a reason string for accurate log messages. Also fixes the fallback getAzcopyAuth log message to correctly say 'fallback getAzcopyAuth failure' instead of 'getAzcopyAuth failure'.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

…und context - Move cleanupShareOnFailure from inline closure to a Driver method - Use background context with 2-minute timeout for cleanup to avoid inheriting a cancelled/expired context from the original request - Rename shareCreatedByCSI to shouldCleanupShare to better reflect the intent (cleanup eligibility, not creation tracking)

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

The helper always creates its own background context for cleanup, so the passed ctx was never used. Remove it to avoid confusion.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

jmclong · 2026-05-13T19:00:33Z

@andyzhangx, if CreateVolume times out because azcopy is taking a long time, am I right in thinking we would just clean up the next CreateVolume call if we get an error? And that timeout of azcopy wouldn't trigger a cleanup?

I'm a bit worried that the ctx cancel would could a cleanup for example, when some larger file share could take a while to copy, resulting a loop where we'd be unable to make progress

mittachaitu · 2026-05-14T08:55:42Z

 			klog.Warningf("azcopy copy failed with AuthorizationPermissionMismatch error, should assign \"Storage File Data Privileged Contributor\" role to controller identity, fall back to use sas token, original error: %v", copyErr)
 			accountSASToken, authAzcopyEnv, err := d.getAzcopyAuth(ctx, accountName, accountKey, storageEndpointSuffix, accountOptions, secret, secretName, secretNamespace, true)
 			if err != nil {
+				d.cleanupShareOnFailure(shouldCleanupShare, accountName, validFileShareName, subsID, resourceGroup, secret, useDataPlaneAPI, "fallback getAzcopyAuth failure")


nit:

Suggested change

d.cleanupShareOnFailure(shouldCleanupShare, accountName, validFileShareName, subsID, resourceGroup, secret, useDataPlaneAPI, "fallback getAzcopyAuth failure")

d.cleanupShareOnFailure(shouldCleanupShare, accountName, validFileShareName, subsID, resourceGroup, secret, useDataPlaneAPI, "sas token fallback getAzcopyAuth failure")

mittachaitu · 2026-05-14T09:00:59Z

 			klog.Warningf("azcopy copy failed with AuthorizationPermissionMismatch error, should assign \"Storage File Data Privileged Contributor\" role to controller identity, fall back to use sas token, original error: %v", copyErr)
 			accountSASToken, authAzcopyEnv, err := d.getAzcopyAuth(ctx, accountName, accountKey, storageEndpointSuffix, accountOptions, secret, secretName, secretNamespace, true)
 			if err != nil {
+				d.cleanupShareOnFailure(shouldCleanupShare, accountName, validFileShareName, subsID, resourceGroup, secret, useDataPlaneAPI, "fallback getAzcopyAuth failure")


I believe we should also check status of ongoing azcopy job (might be triggered by previous reconciliations) before triggering cleanup.

Move the azcopy job status check into cleanupShareOnFailure so that all callers are protected. Before deleting the share, check GetAzcopyJob — if the job state is AzcopyJobRunning, preserve the share so retries can resume rather than starting from zero.

k8s-ci-robot · 2026-05-14T11:26:44Z

@andyzhangx: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-azurefile-csi-driver-e2e-capz	`98bf8ae`	link	true	`/test pull-azurefile-csi-driver-e2e-capz`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

mittachaitu

One minor comment other than that PR LGTM

mittachaitu · 2026-05-15T07:42:07Z

+		// Check if an azcopy job is still running for this share — if so,
+		// skip cleanup to avoid orphaning the job and losing partial progress.
+		jobState, _, err := d.azcopy.GetAzcopyJob(shareName, []string{})
+		if err == nil && jobState == util.AzcopyJobRunning {


It is also good to skip deletion when job is succeeded (Helps to avoid deletion if there is race b/w original timeout and job completion.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 13, 2026

k8s-ci-robot requested a review from cvvz May 13, 2026 02:15

k8s-ci-robot requested a review from gnufied May 13, 2026 02:15

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 13, 2026

andyzhangx force-pushed the fix-orphan-fileshare-cleanup branch from 344feb9 to 1613d0e Compare May 13, 2026 02:17

andyzhangx requested a review from Copilot May 13, 2026 02:21

Copilot started reviewing on behalf of andyzhangx May 13, 2026 02:22 View session