Skip to content

Added retry for docker pull in e2e-common.sh#11238

Merged
k8s-ci-robot merged 1 commit into
kubernetes-sigs:mainfrom
epam:fix/10296-docker-pull-retry
May 18, 2026
Merged

Added retry for docker pull in e2e-common.sh#11238
k8s-ci-robot merged 1 commit into
kubernetes-sigs:mainfrom
epam:fix/10296-docker-pull-retry

Conversation

@ivnovakov
Copy link
Copy Markdown
Contributor

@ivnovakov ivnovakov commented May 15, 2026

What type of PR is this?

/kind flake
/area testing

What this PR does / why we need it:

E2E runs occasionally fail due to transient docker pull errors (e.g. layer verification failures).
This PR adds exponential backoff retry for docker pull in e2e-common.sh.

docker pull exits with status 1 for every failure regardless of cause, so a regex match against the error output detects non-retriable cases (missing manifest, auth denied, disk full) and skips retries for them.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Example of non-retriable error.

Error response from daemon: failed to resolve reference "quay.io/kuberay/operator:does-not-exist": quay.io/kuberay/operator:does-not-exist: not found
ERROR: docker pull 'quay.io/kuberay/operator:does-not-exist' failed with a non-retriable error.

Example for failed retry.

Error response from daemon: failed to resolve reference "intentionally.invalid.example.com/kuberay/operator:v1.6.1": failed to do request: Head "https://intentionally.invalid.example.com/v2/kuberay/operator/manifests/v1.6.1": dial tcp: lookup intentionally.invalid.example.com on 192.168.5.1:53: no such host
WARNING: docker pull 'intentionally.invalid.example.com/kuberay/operator:v1.6.1' failed (attempt 1/5). Retrying in 1s...
...
Error response from daemon: failed to resolve reference "intentionally.invalid.example.com/kuberay/operator:v1.6.1": failed to do request: Head "https://intentionally.invalid.example.com/v2/kuberay/operator/manifests/v1.6.1": dial tcp: lookup intentionally.invalid.example.com on 192.168.5.1:53: no such host
WARNING: docker pull 'intentionally.invalid.example.com/kuberay/operator:v1.6.1' failed (attempt 4/5). Retrying in 8s...
Error response from daemon: failed to resolve reference "intentionally.invalid.example.com/kuberay/operator:v1.6.1": failed to do request: Head "https://intentionally.invalid.example.com/v2/kuberay/operator/manifests/v1.6.1": dial tcp: lookup intentionally.invalid.example.com on 192.168.5.1:53: no such host
ERROR: Failed to pull 'intentionally.invalid.example.com/kuberay/operator:v1.6.1' after 5 attempts.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. area/testing Testing - related stuff labels May 15, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 15, 2026

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit 10da08a
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a072351cb521e0008b40706
😎 Deploy Preview https://deploy-preview-11238--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 15, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @ivnovakov. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 15, 2026
@ivnovakov
Copy link
Copy Markdown
Contributor Author

Related to #10296.

@tenzen-y
Copy link
Copy Markdown
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 15, 2026
fi
echo "$output"

if echo "$output" | grep -qiE 'manifest (unknown|for .* not found)|repository does not exist|not found|pull access denied|unauthorized|denied: requested access|no space left on device'; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the testing strategy here? Could you present some of the outputs for cases that you managed to simulate? For sure some errors are tricky to simulate, but let's share the output for the easy cases at least.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the testing strategy I used.

Sources used to compose the regex

How the function was tested

1. Test script that fakes docker pull

Replaces docker with a stand-in we control.
Walks through every path: happy case, retry-then-success, all retries exhausted, and three "don't retry" patterns (manifest unknown, pull access denied, no space left on device).

→ Temporary errors retry up to 5× with 1/2/4/8s backoff; "don't retry" patterns fail after one attempt; happy path unchanged.

2. make test-e2e, forced to fail two ways

  • KUBERAY_VERSION=does-not-exist — non-retriable (tag doesn't exist).
  • KUBERAY_IMAGE pointed at an invalid hostname — retriable, exhausts
    all 5 attempts.

→ Both runs behaved as expected. Outputs are presented in the PR description.

3. Regex match against the gathered error patterns

Every error string from the "Sources" section above was run directly through grep -qiE '<regex>' — both the real CI flake outputs and the docs / OCI patterns that can't easily be reproduced locally.

→ Every non-retriable pattern matched; every retriable pattern did not. Both CI flakes (#10296 layer verification, #10257 ghcr.io token timeout) correctly fall through to retry.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sgtm, let's give it a try

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 15, 2026

@ivnovakov please take a look at these two cases, do you think they also have the same root cause which could be mitigated by this approach:

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 18, 2026

/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16
Thank you! I think the code looks reasonable, and there was an effort to test it manually by simulating some cases. We will iterate on the approach if some issues remain. hope this will solve some common CI failures

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

@mimowo: once the present PR merges, I will cherry-pick it on top of release-0.16, release-0.17 in new PRs and assign them to you.

Details

In response to this:

/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16
Thank you! I think the code looks reasonable, and there was an effort to test it manually by simulating some cases. We will iterate on the approach if some issues remain. hope this will solve some common CI failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 18, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 1423f0e6a03d07c73451f10b24aedd59e3011a6a

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ivnovakov, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2026
@ivnovakov
Copy link
Copy Markdown
Contributor Author

ivnovakov commented May 18, 2026

@mimowo
Copy link
Copy Markdown
Contributor

mimowo commented May 18, 2026

@mimowo, in both docker pull errors made job to fail.

Cool, will both cases be retried with the new PR?

If so, then I would close the issues along with this PR merging.

@k8s-ci-robot k8s-ci-robot merged commit 2e91dfe into kubernetes-sigs:main May 18, 2026
40 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.18 milestone May 18, 2026
@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

@mimowo: new pull request created: #11289

Details

In response to this:

/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16
Thank you! I think the code looks reasonable, and there was an effort to test it manually by simulating some cases. We will iterate on the approach if some issues remain. hope this will solve some common CI failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown
Contributor

@mimowo: new pull request created: #11290

Details

In response to this:

/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16
Thank you! I think the code looks reasonable, and there was an effort to test it manually by simulating some cases. We will iterate on the approach if some issues remain. hope this will solve some common CI failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ivnovakov
Copy link
Copy Markdown
Contributor Author

@mimowo, in both docker pull errors made job to fail.

Cool, will both cases be retried with the new PR?

If so, then I would close the issues along with this PR merging.

@mimowo, yes, they would've retried.

@ivnovakov ivnovakov deleted the fix/10296-docker-pull-retry branch May 18, 2026 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/testing Testing - related stuff cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants