Skip to content

fix(clone): resume errored azcopy jobs during clone creation#3095

Open
mittachaitu wants to merge 2 commits into
kubernetes-sigs:masterfrom
mittachaitu:del_fileshare
Open

fix(clone): resume errored azcopy jobs during clone creation#3095
mittachaitu wants to merge 2 commits into
kubernetes-sigs:masterfrom
mittachaitu:del_fileshare

Conversation

@mittachaitu
Copy link
Copy Markdown
Contributor

This commit resumes errored jobs during clone creation from volumesnapshot

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR resume errored/skipped azcopy jobs during clone creation.
As of now, when azcopy clone job fails then there is no retry mechanisam
till the csi-driver restarts. This PR trigger azcopy jobs resume <job id>
on errored jobs during clone creation from snapshot.

Which issue(s) this PR fixes:

Fixes #

Requirements:

Special notes for your reviewer:

Release note:

none

This commit resumes errored jobs during clone creation
from volumesnapshot

Signed-off-by: Mitta Sai Chaithanya <mittas@microsoft.com>
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 18, 2026
@k8s-ci-robot k8s-ci-robot requested review from andyzhangx and cvvz April 18, 2026 13:02
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mittachaitu
Once this PR has been reviewed and has the lgtm label, please assign feiskyer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 18, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @mittachaitu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 18, 2026
@andyzhangx andyzhangx requested a review from Copilot April 18, 2026 13:16
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Azure File CSI driver’s clone-from-snapshot flow to attempt resuming failed/skipped AzCopy jobs instead of only retrying after a driver restart.

Changes:

  • Extend Azcopy.GetAzcopyJob to return the AzCopy jobid along with state and percent.
  • Add controller logic to call azcopy jobs resume <jobid> for certain non-successful job terminal states during clone creation.
  • Update unit test call sites to match the new GetAzcopyJob signature.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
pkg/util/util.go Changes GetAzcopyJob signature to include returning the AzCopy job ID.
pkg/util/util_test.go Updates the GetAzcopyJob test call site for the new return value.
pkg/azurefile/controllerserver.go Uses returned jobid for logging and introduces execAzcopyResume + resume logic in clone flow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

klog.V(2).Infof("copy fileshare %s:%s to %s:%s", srcAccountName, srcFileShareName, dstAccountName, dstFileShareName)
execAzcopyJob := func() error {
if out, err := d.execAzcopyCopy(srcPathAuth, dstPath, azcopyCopyOptions, authAzcopyEnv); err != nil {
// Just log the error here since azcopy allows to resume the failed jobs and the job will be resumed in the next re-concile loop
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "re-concile" should be "reconcile".

Suggested change
// Just log the error here since azcopy allows to resume the failed jobs and the job will be resumed in the next re-concile loop
// Just log the error here since azcopy allows to resume the failed jobs and the job will be resumed in the next reconcile loop

Copilot uses AI. Check for mistakes.
Comment on lines +1318 to +1321
// Use --trusted-microsoft-suffixes option to avoid failure caused by
if d.requiredAzCopyToTrust {
azcopyResumeOptions = append(azcopyResumeOptions, fmt.Sprintf("--trusted-microsoft-suffixes=%s", d.getStorageEndPointSuffix()))
}
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "Use --trusted-microsoft-suffixes option to avoid failure caused by" is incomplete, which makes it unclear what failure mode this option is addressing. Please either complete the sentence (e.g., describe the specific trust/suffix validation issue) or remove the comment if it’s redundant with other documentation.

Copilot uses AI. Check for mistakes.
Comment thread pkg/util/util_test.go
jobState, percent, jobid, err := azcopyFunc.GetAzcopyJob(dstFileshare, []string{})
if jobState != test.expectedJobState || percent != test.expectedPercent || !reflect.DeepEqual(err, test.expectedErr) {
t.Errorf("test[%s]: unexpected jobState: %v, percent: %v, err: %v, expected jobState: %v, percent: %v, err: %v", test.desc, jobState, percent, err, test.expectedJobState, test.expectedPercent, test.expectedErr)
t.Errorf("test[%s]: unexpected jobState: %v, percent: %v, jobid: %v, err: %v, expected jobState: %v, percent: %v, jobid: %v, err: %v", test.desc, jobState, percent, jobid, err, test.expectedJobState, test.expectedPercent, jobid, test.expectedErr)
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestGetAzcopyJob's failure message prints the actual jobid in both the "got" and "expected" portions, which makes failures misleading. Either add an expectedJobID field to the test cases and compare it, or remove the expected jobid formatting from the error message.

Suggested change
t.Errorf("test[%s]: unexpected jobState: %v, percent: %v, jobid: %v, err: %v, expected jobState: %v, percent: %v, jobid: %v, err: %v", test.desc, jobState, percent, jobid, err, test.expectedJobState, test.expectedPercent, jobid, test.expectedErr)
t.Errorf("test[%s]: unexpected jobState: %v, percent: %v, jobid: %v, err: %v, expected jobState: %v, percent: %v, err: %v", test.desc, jobState, percent, jobid, err, test.expectedJobState, test.expectedPercent, test.expectedErr)

Copilot uses AI. Check for mistakes.
Comment thread pkg/util/util.go
Comment on lines 119 to 129
jobid, jobState, err := parseAzcopyJobList(out)
if err != nil || jobState == AzcopyJobError {
klog.Warningf("failed to get azcopy job with error: %v, jobState: %v", err, jobState)
return AzcopyJobError, "", fmt.Errorf("couldn't parse azcopy job list in azcopy %v", err)
return AzcopyJobError, "", jobid, fmt.Errorf("couldn't parse azcopy job list in azcopy %v", err)
}
if jobState == AzcopyJobCompleted || jobState == AzcopyJobCompletedWithErrors || jobState == AzcopyJobCompletedWithSkipped || jobState == AzcopyJobCompletedWithErrorsAndSkipped {
return jobState, "100.0", err
return jobState, "100.0", jobid, err
}
if jobid == "" {
return jobState, "", err
return jobState, "", jobid, err
}
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetAzcopyJob now returns a jobid, but parseAzcopyJobList currently returns an empty jobid for completed states (Completed/CompletedWithErrors/CompletedWithSkipped/CompletedWithErrorsAndSkipped). That means callers attempting to resume a failed/skipped job will likely get an empty jobID and resume will fail. Update the parsing logic to return the current segment's JobId for completed-with-errors/skipped states (and adjust unit tests that currently expect an empty jobid for those statuses).

Copilot uses AI. Check for mistakes.
Comment on lines +1240 to +1241
case util.AzcopyJobError, util.AzcopyJobCompletedWithErrors, util.AzcopyJobCompletedWithSkipped, util.AzcopyJobCompletedWithErrorsAndSkipped:
// Resume job that are failed with errors
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resume path includes util.AzcopyJobError, but AzcopyJobError can mean command/list/show/parsing failures (not a resumable AzCopy job). Resuming in this state can mask the real error and also commonly runs with an empty/unknown jobID. Consider handling AzcopyJobError by returning the original error (or re-running copy) and only attempting jobs resume for the explicit completed-with-errors/skipped states when jobid is non-empty.

Suggested change
case util.AzcopyJobError, util.AzcopyJobCompletedWithErrors, util.AzcopyJobCompletedWithSkipped, util.AzcopyJobCompletedWithErrorsAndSkipped:
// Resume job that are failed with errors
case util.AzcopyJobError:
klog.Errorf("Failed to get a resumable azcopy job for copying fileshare %s to %s, status: %s, copy percent: %s%%, jobid: %s, error: %v", srcFileShareName, dstFileShareName, jobState, percent, jobid, err)
if err != nil {
return err
}
return fmt.Errorf("failed to get a resumable azcopy job for copying fileshare %s to %s, status: %s, copy percent: %s%%, jobid: %s", srcFileShareName, dstFileShareName, jobState, percent, jobid)
case util.AzcopyJobCompletedWithErrors, util.AzcopyJobCompletedWithSkipped, util.AzcopyJobCompletedWithErrorsAndSkipped:
// Resume job that completed with errors/skipped entries.
if jobid == "" {
klog.Errorf("Copy of fileshare %s to %s completed with resumable status %s but jobid is empty, copy percent: %s%%, error: %v", srcFileShareName, dstFileShareName, jobState, percent, err)
if err != nil {
return err
}
return fmt.Errorf("azcopy job cannot be resumed for fileshare %s to %s: empty jobid for status %s", srcFileShareName, dstFileShareName, jobState)
}

Copilot uses AI. Check for mistakes.
Comment on lines +1243 to +1246
resumeAzcopyJob := func() error {
klog.V(2).Infof("Resuming azcopy job with id: %s for copying fileshare %s to %s", jobid, srcFileShareName, dstFileShareName)
if resumeOut, resumeErr := d.execAzcopyResume(jobid, authAzcopyEnv); resumeErr != nil {
klog.Errorf("Failed to resume azcopy job with id: %s for copying fileshare %s to %s, error: %v, output: %s", jobid, srcFileShareName, dstFileShareName, resumeErr, string(resumeOut))
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before calling execAzcopyResume(jobid, ...), ensure jobid is non-empty. With the current GetAzcopyJob/parsing behavior it's possible to enter this branch with an empty jobID (e.g., completed-with-errors/skipped states), which will invoke azcopy jobs resume "" and fail with a confusing error.

Copilot uses AI. Check for mistakes.
@k8s-triage-robot
Copy link
Copy Markdown

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 18, 2026
Signed-off-by: Mitta Sai Chaithanya <mittas@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants