fix(clone): resume errored azcopy jobs during clone creation#3095
fix(clone): resume errored azcopy jobs during clone creation#3095mittachaitu wants to merge 2 commits into
Conversation
This commit resumes errored jobs during clone creation from volumesnapshot Signed-off-by: Mitta Sai Chaithanya <mittas@microsoft.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mittachaitu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @mittachaitu. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Pull request overview
This PR updates the Azure File CSI driver’s clone-from-snapshot flow to attempt resuming failed/skipped AzCopy jobs instead of only retrying after a driver restart.
Changes:
- Extend
Azcopy.GetAzcopyJobto return the AzCopyjobidalong with state and percent. - Add controller logic to call
azcopy jobs resume <jobid>for certain non-successful job terminal states during clone creation. - Update unit test call sites to match the new
GetAzcopyJobsignature.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| pkg/util/util.go | Changes GetAzcopyJob signature to include returning the AzCopy job ID. |
| pkg/util/util_test.go | Updates the GetAzcopyJob test call site for the new return value. |
| pkg/azurefile/controllerserver.go | Uses returned jobid for logging and introduces execAzcopyResume + resume logic in clone flow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| klog.V(2).Infof("copy fileshare %s:%s to %s:%s", srcAccountName, srcFileShareName, dstAccountName, dstFileShareName) | ||
| execAzcopyJob := func() error { | ||
| if out, err := d.execAzcopyCopy(srcPathAuth, dstPath, azcopyCopyOptions, authAzcopyEnv); err != nil { | ||
| // Just log the error here since azcopy allows to resume the failed jobs and the job will be resumed in the next re-concile loop |
There was a problem hiding this comment.
Typo in comment: "re-concile" should be "reconcile".
| // Just log the error here since azcopy allows to resume the failed jobs and the job will be resumed in the next re-concile loop | |
| // Just log the error here since azcopy allows to resume the failed jobs and the job will be resumed in the next reconcile loop |
| // Use --trusted-microsoft-suffixes option to avoid failure caused by | ||
| if d.requiredAzCopyToTrust { | ||
| azcopyResumeOptions = append(azcopyResumeOptions, fmt.Sprintf("--trusted-microsoft-suffixes=%s", d.getStorageEndPointSuffix())) | ||
| } |
There was a problem hiding this comment.
The comment "Use --trusted-microsoft-suffixes option to avoid failure caused by" is incomplete, which makes it unclear what failure mode this option is addressing. Please either complete the sentence (e.g., describe the specific trust/suffix validation issue) or remove the comment if it’s redundant with other documentation.
| jobState, percent, jobid, err := azcopyFunc.GetAzcopyJob(dstFileshare, []string{}) | ||
| if jobState != test.expectedJobState || percent != test.expectedPercent || !reflect.DeepEqual(err, test.expectedErr) { | ||
| t.Errorf("test[%s]: unexpected jobState: %v, percent: %v, err: %v, expected jobState: %v, percent: %v, err: %v", test.desc, jobState, percent, err, test.expectedJobState, test.expectedPercent, test.expectedErr) | ||
| t.Errorf("test[%s]: unexpected jobState: %v, percent: %v, jobid: %v, err: %v, expected jobState: %v, percent: %v, jobid: %v, err: %v", test.desc, jobState, percent, jobid, err, test.expectedJobState, test.expectedPercent, jobid, test.expectedErr) |
There was a problem hiding this comment.
TestGetAzcopyJob's failure message prints the actual jobid in both the "got" and "expected" portions, which makes failures misleading. Either add an expectedJobID field to the test cases and compare it, or remove the expected jobid formatting from the error message.
| t.Errorf("test[%s]: unexpected jobState: %v, percent: %v, jobid: %v, err: %v, expected jobState: %v, percent: %v, jobid: %v, err: %v", test.desc, jobState, percent, jobid, err, test.expectedJobState, test.expectedPercent, jobid, test.expectedErr) | |
| t.Errorf("test[%s]: unexpected jobState: %v, percent: %v, jobid: %v, err: %v, expected jobState: %v, percent: %v, err: %v", test.desc, jobState, percent, jobid, err, test.expectedJobState, test.expectedPercent, test.expectedErr) |
| jobid, jobState, err := parseAzcopyJobList(out) | ||
| if err != nil || jobState == AzcopyJobError { | ||
| klog.Warningf("failed to get azcopy job with error: %v, jobState: %v", err, jobState) | ||
| return AzcopyJobError, "", fmt.Errorf("couldn't parse azcopy job list in azcopy %v", err) | ||
| return AzcopyJobError, "", jobid, fmt.Errorf("couldn't parse azcopy job list in azcopy %v", err) | ||
| } | ||
| if jobState == AzcopyJobCompleted || jobState == AzcopyJobCompletedWithErrors || jobState == AzcopyJobCompletedWithSkipped || jobState == AzcopyJobCompletedWithErrorsAndSkipped { | ||
| return jobState, "100.0", err | ||
| return jobState, "100.0", jobid, err | ||
| } | ||
| if jobid == "" { | ||
| return jobState, "", err | ||
| return jobState, "", jobid, err | ||
| } |
There was a problem hiding this comment.
GetAzcopyJob now returns a jobid, but parseAzcopyJobList currently returns an empty jobid for completed states (Completed/CompletedWithErrors/CompletedWithSkipped/CompletedWithErrorsAndSkipped). That means callers attempting to resume a failed/skipped job will likely get an empty jobID and resume will fail. Update the parsing logic to return the current segment's JobId for completed-with-errors/skipped states (and adjust unit tests that currently expect an empty jobid for those statuses).
| case util.AzcopyJobError, util.AzcopyJobCompletedWithErrors, util.AzcopyJobCompletedWithSkipped, util.AzcopyJobCompletedWithErrorsAndSkipped: | ||
| // Resume job that are failed with errors |
There was a problem hiding this comment.
The resume path includes util.AzcopyJobError, but AzcopyJobError can mean command/list/show/parsing failures (not a resumable AzCopy job). Resuming in this state can mask the real error and also commonly runs with an empty/unknown jobID. Consider handling AzcopyJobError by returning the original error (or re-running copy) and only attempting jobs resume for the explicit completed-with-errors/skipped states when jobid is non-empty.
| case util.AzcopyJobError, util.AzcopyJobCompletedWithErrors, util.AzcopyJobCompletedWithSkipped, util.AzcopyJobCompletedWithErrorsAndSkipped: | |
| // Resume job that are failed with errors | |
| case util.AzcopyJobError: | |
| klog.Errorf("Failed to get a resumable azcopy job for copying fileshare %s to %s, status: %s, copy percent: %s%%, jobid: %s, error: %v", srcFileShareName, dstFileShareName, jobState, percent, jobid, err) | |
| if err != nil { | |
| return err | |
| } | |
| return fmt.Errorf("failed to get a resumable azcopy job for copying fileshare %s to %s, status: %s, copy percent: %s%%, jobid: %s", srcFileShareName, dstFileShareName, jobState, percent, jobid) | |
| case util.AzcopyJobCompletedWithErrors, util.AzcopyJobCompletedWithSkipped, util.AzcopyJobCompletedWithErrorsAndSkipped: | |
| // Resume job that completed with errors/skipped entries. | |
| if jobid == "" { | |
| klog.Errorf("Copy of fileshare %s to %s completed with resumable status %s but jobid is empty, copy percent: %s%%, error: %v", srcFileShareName, dstFileShareName, jobState, percent, err) | |
| if err != nil { | |
| return err | |
| } | |
| return fmt.Errorf("azcopy job cannot be resumed for fileshare %s to %s: empty jobid for status %s", srcFileShareName, dstFileShareName, jobState) | |
| } |
| resumeAzcopyJob := func() error { | ||
| klog.V(2).Infof("Resuming azcopy job with id: %s for copying fileshare %s to %s", jobid, srcFileShareName, dstFileShareName) | ||
| if resumeOut, resumeErr := d.execAzcopyResume(jobid, authAzcopyEnv); resumeErr != nil { | ||
| klog.Errorf("Failed to resume azcopy job with id: %s for copying fileshare %s to %s, error: %v, output: %s", jobid, srcFileShareName, dstFileShareName, resumeErr, string(resumeOut)) |
There was a problem hiding this comment.
Before calling execAzcopyResume(jobid, ...), ensure jobid is non-empty. With the current GetAzcopyJob/parsing behavior it's possible to enter this branch with an empty jobID (e.g., completed-with-errors/skipped states), which will invoke azcopy jobs resume "" and fail with a confusing error.
|
Unknown CLA label state. Rechecking for CLA labels. Send feedback to sig-contributor-experience at kubernetes/community. /check-cla |
Signed-off-by: Mitta Sai Chaithanya <mittas@microsoft.com>
This commit resumes errored jobs during clone creation from volumesnapshot
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR resume errored/skipped azcopy jobs during clone creation.
As of now, when azcopy clone job fails then there is no retry mechanisam
till the csi-driver restarts. This PR trigger
azcopy jobs resume <job id>on errored jobs during clone creation from snapshot.
Which issue(s) this PR fixes:
Fixes #
Requirements:
Special notes for your reviewer:
Release note: