Skip to content

[WIP] fix: clean up orphaned file share on CreateVolume failure#3151

Open
andyzhangx wants to merge 9 commits into
kubernetes-sigs:masterfrom
andyzhangx:fix-orphan-fileshare-cleanup
Open

[WIP] fix: clean up orphaned file share on CreateVolume failure#3151
andyzhangx wants to merge 9 commits into
kubernetes-sigs:masterfrom
andyzhangx:fix-orphan-fileshare-cleanup

Conversation

@andyzhangx
Copy link
Copy Markdown
Member

What type of PR is this?

/kind bug

What this PR does / why we need it

When restoring a PVC from a VolumeSnapshot, CreateVolume creates the destination file share before starting the azcopy data copy. If azcopy or auth setup fails, the function returns an error but never deletes the already-created file share — leaving it orphaned in Azure Storage with no way to clean it up through normal CSI operations (no PV/snapshot exists to trigger DeleteVolume/DeleteSnapshot).

This PR adds best-effort DeleteFileShare cleanup at all three failure paths after the share has been created:

  1. First getAzcopyAuth failure
  2. Fallback getAzcopyAuth failure (SAS token retry after AuthorizationPermissionMismatch)
  3. copyVolume (azcopy) failure

Cleanup errors are logged as warnings but do not mask the original error.

Which issue(s) this PR fixes

Fixes #3149

How to test

  1. Follow the reproduction steps in Orphaned File Share left on CreateVolume failure #3149 (firewall the destination storage account data plane)
  2. Create a PVC from snapshot targeting the firewalled account
  3. Verify that azcopy fails as expected
  4. Confirm the file share is cleaned up automatically (no orphan left in the storage account)

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 13, 2026
@k8s-ci-robot k8s-ci-robot requested a review from cvvz May 13, 2026 02:15
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from gnufied May 13, 2026 02:15
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 13, 2026
When restoring from a VolumeSnapshot, CreateVolume creates the
destination file share before starting the azcopy data copy. If
azcopy (or auth setup) fails, the function returns an error but
never deletes the already-created file share, leaving it orphaned.

Add best-effort cleanup (DeleteFileShare) at all three failure
paths after the share has been created:
1. First getAzcopyAuth failure
2. Fallback getAzcopyAuth failure (SAS token retry)
3. copyVolume (azcopy) failure

Cleanup errors are logged as warnings but do not mask the original
error returned to the caller.

Fixes kubernetes-sigs#3149
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses an orphan-resource bug in the Azure File CSI driver by adding best-effort rollback deletion of a newly created destination file share when CreateVolume (snapshot restore / volume cloning path) fails after share creation. This helps prevent leaked Azure File Shares when azcopy/auth setup fails and no PV/snapshot is created to trigger normal CSI deletion flows.

Changes:

  • Track whether the destination file share was created during CreateVolume and attempt best-effort cleanup on subsequent failures.
  • Add DeleteFileShare cleanup on getAzcopyAuth failure, fallback getAzcopyAuth failure, and copyVolume (azcopy) failure.
  • Log cleanup failures as warnings while preserving the original error return.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/azurefile/controllerserver.go
Comment thread pkg/azurefile/controllerserver.go
…Exists

When ShareAlreadyExists is returned during snapshot restore/cloning,
distinguish between:
- User-specified share (fileShareName != ''): skip cleanup since the
  share was pre-existing and not created by CSI
- Auto-generated share (fileShareName == ''): keep cleanup enabled
  since ShareAlreadyExists likely means a previous CSI attempt
  created it but failed later (the exact orphan scenario)
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

pkg/azurefile/controllerserver.go:741

  • The ShareAlreadyExists error branch toggles shareCreatedByCSI based on fileShareName != "", but it doesn’t cover the common path where CreateFileShare returns nil because the share already exists (quota pre-check). That means a user-provided shareName can still be deleted on later azcopy/auth failures. Suggest explicitly setting shareCreatedByCSI to false whenever fileShareName is provided (or when the share is detected as already existing) before reaching the azcopy steps.
		if req.GetVolumeContentSource() != nil && strings.Contains(err.Error(), "ShareAlreadyExists") {
			// for snapshot restore and volume cloning, ignore ShareAlreadyExists error since the file share should be created first
			klog.Warningf("create file share(%s) on account(%s) type(%s) subID(%s) rg(%s) location(%s) size(%d), ignore ShareAlreadyExists error for snapshot restore and volume cloning", validFileShareName, accountName, sku, subsID, resourceGroup, location, fileShareSize)
			// If the share name was auto-generated by CSI (fileShareName is empty),
			// ShareAlreadyExists likely means a previous CreateVolume attempt created it
			// but failed later. We should still clean it up on subsequent failures.
			// Only skip cleanup when the user explicitly specified a share name.
			if fileShareName != "" {
				shareCreatedByCSI = false
			}

Comment thread pkg/azurefile/controllerserver.go
…e-existing shares

When fileShareName is user-specified (non-empty), default shareCreatedByCSI
to false since CreateFileShare silently succeeds when the share already exists
(via internal GetFileShareQuota pre-check returning nil). This prevents
cleanup from deleting a pre-existing user share on subsequent azcopy failures.

For auto-generated share names (fileShareName is empty), CSI owns the
lifecycle, so shareCreatedByCSI defaults to true — including the
ShareAlreadyExists case (previous failed CSI attempt).
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread pkg/azurefile/controllerserver.go Outdated
Comment thread pkg/azurefile/controllerserver.go Outdated
Extract the duplicated cleanup logic (3 call sites) into a local
closure that takes a reason string for accurate log messages.
Also fixes the fallback getAzcopyAuth log message to correctly
say 'fallback getAzcopyAuth failure' instead of 'getAzcopyAuth failure'.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread pkg/azurefile/controllerserver.go Outdated
Comment thread pkg/azurefile/controllerserver.go Outdated
…und context

- Move cleanupShareOnFailure from inline closure to a Driver method
- Use background context with 2-minute timeout for cleanup to avoid
  inheriting a cancelled/expired context from the original request
- Rename shareCreatedByCSI to shouldCleanupShare to better reflect
  the intent (cleanup eligibility, not creation tracking)
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread pkg/azurefile/controllerserver.go Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread pkg/azurefile/controllerserver.go
@andyzhangx andyzhangx requested a review from Copilot May 13, 2026 03:58
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread pkg/azurefile/controllerserver.go
The helper always creates its own background context for cleanup,
so the passed ctx was never used. Remove it to avoid confusion.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

@jmclong
Copy link
Copy Markdown

jmclong commented May 13, 2026

@andyzhangx, if CreateVolume times out because azcopy is taking a long time, am I right in thinking we would just clean up the next CreateVolume call if we get an error? And that timeout of azcopy wouldn't trigger a cleanup?

I'm a bit worried that the ctx cancel would could a cleanup for example, when some larger file share could take a while to copy, resulting a loop where we'd be unable to make progress

Comment thread pkg/azurefile/controllerserver.go
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 14, 2026
klog.Warningf("azcopy copy failed with AuthorizationPermissionMismatch error, should assign \"Storage File Data Privileged Contributor\" role to controller identity, fall back to use sas token, original error: %v", copyErr)
accountSASToken, authAzcopyEnv, err := d.getAzcopyAuth(ctx, accountName, accountKey, storageEndpointSuffix, accountOptions, secret, secretName, secretNamespace, true)
if err != nil {
d.cleanupShareOnFailure(shouldCleanupShare, accountName, validFileShareName, subsID, resourceGroup, secret, useDataPlaneAPI, "fallback getAzcopyAuth failure")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
d.cleanupShareOnFailure(shouldCleanupShare, accountName, validFileShareName, subsID, resourceGroup, secret, useDataPlaneAPI, "fallback getAzcopyAuth failure")
d.cleanupShareOnFailure(shouldCleanupShare, accountName, validFileShareName, subsID, resourceGroup, secret, useDataPlaneAPI, "sas token fallback getAzcopyAuth failure")

klog.Warningf("azcopy copy failed with AuthorizationPermissionMismatch error, should assign \"Storage File Data Privileged Contributor\" role to controller identity, fall back to use sas token, original error: %v", copyErr)
accountSASToken, authAzcopyEnv, err := d.getAzcopyAuth(ctx, accountName, accountKey, storageEndpointSuffix, accountOptions, secret, secretName, secretNamespace, true)
if err != nil {
d.cleanupShareOnFailure(shouldCleanupShare, accountName, validFileShareName, subsID, resourceGroup, secret, useDataPlaneAPI, "fallback getAzcopyAuth failure")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should also check status of ongoing azcopy job (might be triggered by previous reconciliations) before triggering cleanup.

Move the azcopy job status check into cleanupShareOnFailure so that
all callers are protected. Before deleting the share, check
GetAzcopyJob — if the job state is AzcopyJobRunning, preserve the
share so retries can resume rather than starting from zero.
@andyzhangx andyzhangx force-pushed the fix-orphan-fileshare-cleanup branch from c805458 to 98bf8ae Compare May 14, 2026 09:09
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@andyzhangx: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-azurefile-csi-driver-e2e-capz 98bf8ae link true /test pull-azurefile-csi-driver-e2e-capz

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Copy Markdown
Contributor

@mittachaitu mittachaitu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment other than that PR LGTM

// Check if an azcopy job is still running for this share — if so,
// skip cleanup to avoid orphaning the job and losing partial progress.
jobState, _, err := d.azcopy.GetAzcopyJob(shareName, []string{})
if err == nil && jobState == util.AzcopyJobRunning {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also good to skip deletion when job is succeeded (Helps to avoid deletion if there is race b/w original timeout and job completion.

@andyzhangx andyzhangx changed the title fix: clean up orphaned file share on CreateVolume failure [WIP] fix: clean up orphaned file share on CreateVolume failure May 19, 2026
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Orphaned File Share left on CreateVolume failure

6 participants