Added retry for docker pull in e2e-common.sh#11238
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Hi @ivnovakov. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Related to #10296. |
|
/ok-to-test |
| fi | ||
| echo "$output" | ||
|
|
||
| if echo "$output" | grep -qiE 'manifest (unknown|for .* not found)|repository does not exist|not found|pull access denied|unauthorized|denied: requested access|no space left on device'; then |
There was a problem hiding this comment.
What was the testing strategy here? Could you present some of the outputs for cases that you managed to simulate? For sure some errors are tricky to simulate, but let's share the output for the easy cases at least.
There was a problem hiding this comment.
Here is the testing strategy I used.
Sources used to compose the regex
- Docker-pull common-errors documentation + errors I found on the internet (manifest unknown, pull access denied, EOF, DNS errors, no space left on device, etc.).
- OCI Distribution Spec error codes (
MANIFEST_UNKNOWN,DENIED,NAME_UNKNOWN). - Two real production CI flakes:
- [flaky e2e] suites failing on kind cluster creation with "memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused" #10296
filesystem layer verification failedon quay.io
- [Flaky E2E] failed to create cluster: failed to init node with kubeadm #10257
context deadline exceededon ghcr.io's token endpoint
- [flaky e2e] suites failing on kind cluster creation with "memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused" #10296
How the function was tested
1. Test script that fakes docker pull
Replaces docker with a stand-in we control.
Walks through every path: happy case, retry-then-success, all retries exhausted, and three "don't retry" patterns (manifest unknown, pull access denied, no space left on device).
→ Temporary errors retry up to 5× with 1/2/4/8s backoff; "don't retry" patterns fail after one attempt; happy path unchanged.
2. make test-e2e, forced to fail two ways
KUBERAY_VERSION=does-not-exist— non-retriable (tag doesn't exist).KUBERAY_IMAGEpointed at an invalid hostname — retriable, exhausts
all 5 attempts.
→ Both runs behaved as expected. Outputs are presented in the PR description.
3. Regex match against the gathered error patterns
Every error string from the "Sources" section above was run directly through grep -qiE '<regex>' — both the real CI flake outputs and the docs / OCI patterns that can't easily be reproduced locally.
→ Every non-retriable pattern matched; every retriable pattern did not. Both CI flakes (#10296 layer verification, #10257 ghcr.io token timeout) correctly fall through to retry.
|
@ivnovakov please take a look at these two cases, do you think they also have the same root cause which could be mitigated by this approach: |
|
/lgtm |
|
@mimowo: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
LGTM label has been added. DetailsGit tree hash: 1423f0e6a03d07c73451f10b24aedd59e3011a6a |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ivnovakov, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@mimowo, in both
|
Cool, will both cases be retried with the new PR? If so, then I would close the issues along with this PR merging. |
|
@mimowo: new pull request created: #11289 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@mimowo: new pull request created: #11290 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind flake
/area testing
What this PR does / why we need it:
E2E runs occasionally fail due to transient
docker pullerrors (e.g. layer verification failures).This PR adds exponential backoff retry for
docker pulline2e-common.sh.docker pullexits with status 1 for every failure regardless of cause, so a regex match against the error output detects non-retriable cases (missing manifest, auth denied, disk full) and skips retries for them.Which issue(s) this PR fixes:
Special notes for your reviewer:
Example of non-retriable error.
Example for failed retry.
Does this PR introduce a user-facing change?