fix(reset): add etcd backup directory cleanup to prevent redeployment failures#13221
fix(reset): add etcd backup directory cleanup to prevent redeployment failures#13221Irshu786 wants to merge 2 commits into
Conversation
Add tasks to the reset role to find and remove etcd backup directories.
|
Welcome @Irshu786! |
|
Hi @Irshu786. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Irshu786 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Adds cleanup logic to the reset role so that etcd backup directories created under /var/backups (e.g. /var/backups/etcd-*) are removed during a cluster reset, aligning reset behavior with “remove all etcd-related data”.
Changes:
- Find directories matching
etcd-*under/var/backups. - Remove any matched etcd backup directories during
playbooks/reset.ymlruns.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thank you for your contribution. |
|
/easycla |
|
I think the CLA error is because of copilot commit! |
|
Please follow pull request template.
|
|
/ok-to-test |
|
/retest-failed |
1 similar comment
|
/retest-failed |
|
/retest |
| - /etc/origin/ovn | ||
| - "{{ sysctl_file_path }}" | ||
| - /etc/crictl.yaml | ||
| - "{{ etcd_backup_prefix }}" |
There was a problem hiding this comment.
I tested it, this will fail because that variable is undefined when reset playbook is running.
and this is not a good idea to delete everything in /var/backups dir, we need to delete {{ etcd_backup_prefix }}/etcd-* files. but i don't know how to access that variable.
There was a problem hiding this comment.
can we add a task rather than env approach?
|
Thanks for the fix @Irshu786! One small concern: reset wiping the etcd backup directory feels a bit risky — backups are often the last recovery option if something goes wrong, including after an accidental reset. Also, the current diff removes the whole If the underlying issue is that stale backup dirs break redeployment, would it make sense to harden - name: Remove old etcd backups
ansible.builtin.file:
state: absent
path: "{{ item }}"
loop: "{{ (_etcd_backups.files | sort(attribute='ctime', reverse=True))[etcd_backup_retention_count:] | map(attribute='path') }}"
when: etcd_backup_retention_count >= 0
ignore_errors: true # noqa ignore-errors - don't block etcd restart on stale backup cleanupJust a thought — curious what others think. |
Thanks, That makes sense !! can i update the PR with suggested changes? |


What type of PR is this?
/kind fix
What this PR does / why we need it:
This PR removes the etcd backup directory in the reset role.
The reset playbook did not remove the etcd backup directory (/var/backups/etcd by default). As a result, stale backup data from the previous deployment remained on the node, causing subsequent cluster deployments to fail or hang during the etcd backup job.
This can lead to increased operational toil, with users spending additional time troubleshooting failed or stuck deployments.
Special notes for your reviewer:
The etcd_backup_prefix variable already exists in Kubespray and points to the etcd backup directory (e.g., /var/backups/etcd)
Does this PR introduce a user-facing change?:
NONE