Added retry for docker pull in e2e-common.sh by ivnovakov · Pull Request #11238 · kubernetes-sigs/kueue

ivnovakov · 2026-05-15T13:44:46Z

What type of PR is this?

/kind flake
/area testing

What this PR does / why we need it:

E2E runs occasionally fail due to transient docker pull errors (e.g. layer verification failures).
This PR adds exponential backoff retry for docker pull in e2e-common.sh.

docker pull exits with status 1 for every failure regardless of cause, so a regex match against the error output detects non-retriable cases (missing manifest, auth denied, disk full) and skips retries for them.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Example of non-retriable error.

Error response from daemon: failed to resolve reference "quay.io/kuberay/operator:does-not-exist": quay.io/kuberay/operator:does-not-exist: not found
ERROR: docker pull 'quay.io/kuberay/operator:does-not-exist' failed with a non-retriable error.

Example for failed retry.

Error response from daemon: failed to resolve reference "intentionally.invalid.example.com/kuberay/operator:v1.6.1": failed to do request: Head "https://intentionally.invalid.example.com/v2/kuberay/operator/manifests/v1.6.1": dial tcp: lookup intentionally.invalid.example.com on 192.168.5.1:53: no such host
WARNING: docker pull 'intentionally.invalid.example.com/kuberay/operator:v1.6.1' failed (attempt 1/5). Retrying in 1s...
...
Error response from daemon: failed to resolve reference "intentionally.invalid.example.com/kuberay/operator:v1.6.1": failed to do request: Head "https://intentionally.invalid.example.com/v2/kuberay/operator/manifests/v1.6.1": dial tcp: lookup intentionally.invalid.example.com on 192.168.5.1:53: no such host
WARNING: docker pull 'intentionally.invalid.example.com/kuberay/operator:v1.6.1' failed (attempt 4/5). Retrying in 8s...
Error response from daemon: failed to resolve reference "intentionally.invalid.example.com/kuberay/operator:v1.6.1": failed to do request: Head "https://intentionally.invalid.example.com/v2/kuberay/operator/manifests/v1.6.1": dial tcp: lookup intentionally.invalid.example.com on 192.168.5.1:53: no such host
ERROR: Failed to pull 'intentionally.invalid.example.com/kuberay/operator:v1.6.1' after 5 attempts.

Does this PR introduce a user-facing change?

NONE

netlify · 2026-05-15T13:44:52Z

✅ Deploy Preview for kubernetes-sigs-kueue ready!

Name	Link
🔨 Latest commit	`10da08a`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a072351cb521e0008b40706
😎 Deploy Preview	https://deploy-preview-11238--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2026-05-15T13:44:57Z

Hi @ivnovakov. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ivnovakov · 2026-05-15T13:45:08Z

Related to #10296.

tenzen-y · 2026-05-15T14:07:47Z

/ok-to-test

mimowo · 2026-05-15T14:47:08Z

+        fi
+        echo "$output"
+
+        if echo "$output" | grep -qiE 'manifest (unknown|for .* not found)|repository does not exist|not found|pull access denied|unauthorized|denied: requested access|no space left on device'; then


What was the testing strategy here? Could you present some of the outputs for cases that you managed to simulate? For sure some errors are tricky to simulate, but let's share the output for the easy cases at least.

Here is the testing strategy I used.

Sources used to compose the regex

Docker-pull common-errors documentation + errors I found on the internet (manifest unknown, pull access denied, EOF, DNS errors, no space left on device, etc.).

OCI Distribution Spec error codes (MANIFEST_UNKNOWN, DENIED, NAME_UNKNOWN).

Two real production CI flakes:

[flaky e2e] suites failing on kind cluster creation with "memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused" #10296

filesystem layer verification failed on quay.io

[Flaky E2E] failed to create cluster: failed to init node with kubeadm #10257

context deadline exceeded on ghcr.io's token endpoint

How the function was tested

1. Test script that fakes docker pull

Replaces docker with a stand-in we control.
Walks through every path: happy case, retry-then-success, all retries exhausted, and three "don't retry" patterns (manifest unknown, pull access denied, no space left on device).

→ Temporary errors retry up to 5× with 1/2/4/8s backoff; "don't retry" patterns fail after one attempt; happy path unchanged.

2. make test-e2e, forced to fail two ways

KUBERAY_VERSION=does-not-exist — non-retriable (tag doesn't exist).

KUBERAY_IMAGE pointed at an invalid hostname — retriable, exhausts
all 5 attempts.

→ Both runs behaved as expected. Outputs are presented in the PR description.

3. Regex match against the gathered error patterns

Every error string from the "Sources" section above was run directly through grep -qiE '<regex>' — both the real CI flake outputs and the docs / OCI patterns that can't easily be reproduced locally.

→ Every non-retriable pattern matched; every retriable pattern did not. Both CI flakes (#10296 layer verification, #10257 ghcr.io token timeout) correctly fall through to retry.

Sgtm, let's give it a try

mimowo · 2026-05-15T14:49:50Z

@ivnovakov please take a look at these two cases, do you think they also have the same root cause which could be mitigated by this approach:

mimowo · 2026-05-18T12:09:16Z

/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16
Thank you! I think the code looks reasonable, and there was an effort to test it manually by simulating some cases. We will iterate on the approach if some issues remain. hope this will solve some common CI failures

k8s-infra-cherrypick-robot · 2026-05-18T12:09:19Z

@mimowo: once the present PR merges, I will cherry-pick it on top of release-0.16, release-0.17 in new PRs and assign them to you.

Details

In response to this:

/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16
Thank you! I think the code looks reasonable, and there was an effort to test it manually by simulating some cases. We will iterate on the approach if some issues remain. hope this will solve some common CI failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-05-18T12:09:26Z

LGTM label has been added.

Details

Git tree hash: 1423f0e6a03d07c73451f10b24aedd59e3011a6a

k8s-ci-robot · 2026-05-18T12:09:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ivnovakov, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~hack/testing/OWNERS~~ [mimowo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ivnovakov · 2026-05-18T12:10:17Z

@ivnovakov please take a look at these two cases, do you think they also have the same root cause which could be mitigated by this approach:

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/11192/pull-kueue-test-e2e-multikueue-main/2055288635766345728

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/11192/pull-kueue-test-e2e-main-1-34/2055288635565019136

@mimowo, in both docker pull errors made job to fail.

received unexpected HTTP status: 504 Gateway Time-out
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/11192/pull-kueue-test-e2e-multikueue-main/2055288635766345728#1:build-log.txt%3A979

filesystem layer verification failed for digest sha256:c65bb0c25578bf8f2a8b87d1996dfd93a5330c03195c8f0401cf88b2e0de9210
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/11192/pull-kueue-test-e2e-main-1-34/2055288635565019136#1:build-log.txt%3A1944

mimowo · 2026-05-18T12:31:48Z

@mimowo, in both docker pull errors made job to fail.

Cool, will both cases be retried with the new PR?

If so, then I would close the issues along with this PR merging.

k8s-infra-cherrypick-robot · 2026-05-18T13:06:53Z

@mimowo: new pull request created: #11289

Details

In response to this:

/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16
Thank you! I think the code looks reasonable, and there was an effort to test it manually by simulating some cases. We will iterate on the approach if some issues remain. hope this will solve some common CI failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-infra-cherrypick-robot · 2026-05-18T13:07:32Z

@mimowo: new pull request created: #11290

Details

In response to this:

/lgtm
/approve
/cherrypick release-0.17
/cherrypick release-0.16
Thank you! I think the code looks reasonable, and there was an effort to test it manually by simulating some cases. We will iterate on the approach if some issues remain. hope this will solve some common CI failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ivnovakov · 2026-05-18T15:48:54Z

@mimowo, in both docker pull errors made job to fail.

Cool, will both cases be retried with the new PR?

If so, then I would close the issues along with this PR merging.

@mimowo, yes, they would've retried.

Added retry for docker pull in e2e-common.sh

10da08a

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. area/testing Testing - related stuff labels May 15, 2026

k8s-ci-robot requested review from pajakd and sohankunkerkar May 15, 2026 13:44

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 15, 2026

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 15, 2026

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 15, 2026

ivnovakov mentioned this pull request May 15, 2026

[Flaky E2E] failed to create cluster: failed to init node with kubeadm #10257

Closed

mimowo reviewed May 15, 2026

View reviewed changes

mimowo mentioned this pull request May 15, 2026

Exclude kube-system from webhooks targeting common resources #11192

Open

k8s-ci-robot assigned mimowo May 18, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 18, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2026

k8s-ci-robot merged commit 2e91dfe into kubernetes-sigs:main May 18, 2026
40 checks passed

k8s-ci-robot added this to the v0.18 milestone May 18, 2026

k8s-infra-cherrypick-robot mentioned this pull request May 18, 2026

[release-0.17] Added retry for docker pull in e2e-common.sh #11289

Merged

k8s-infra-cherrypick-robot mentioned this pull request May 18, 2026

[release-0.16] Added retry for docker pull in e2e-common.sh #11290

Merged

ivnovakov deleted the fix/10296-docker-pull-retry branch May 18, 2026 15:48

Conversation

ivnovakov commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

netlify Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-kueue ready!

Uh oh!

k8s-ci-robot commented May 15, 2026

Uh oh!

ivnovakov commented May 15, 2026

Uh oh!

tenzen-y commented May 15, 2026

Uh oh!

mimowo May 15, 2026

Choose a reason for hiding this comment

Uh oh!

ivnovakov May 18, 2026

Choose a reason for hiding this comment

Sources used to compose the regex

How the function was tested

Uh oh!

mimowo May 18, 2026

Choose a reason for hiding this comment

Uh oh!

mimowo commented May 15, 2026

Uh oh!

mimowo commented May 18, 2026

Uh oh!

k8s-infra-cherrypick-robot commented May 18, 2026

Uh oh!

k8s-ci-robot commented May 18, 2026

Uh oh!

k8s-ci-robot commented May 18, 2026

Uh oh!

ivnovakov commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mimowo commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

k8s-infra-cherrypick-robot commented May 18, 2026

Uh oh!

k8s-infra-cherrypick-robot commented May 18, 2026

Uh oh!

ivnovakov commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ivnovakov commented May 15, 2026 •

edited

Loading

netlify Bot commented May 15, 2026 •

edited

Loading

ivnovakov commented May 18, 2026 •

edited

Loading

mimowo commented May 18, 2026 •

edited

Loading