Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 27 additions & 1 deletion hack/testing/e2e-common.sh
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,33 @@ function e2e_docker_pull_if_needed {
echo "Image '$image' already cached locally; skipping pull (E2E_MODE=dev, E2E_SKIP_IMAGE_RELOAD=true)"
return 0
fi
docker pull "$image"

local max_retries=5
local retry_delay=1
local attempt output
for attempt in $(seq 1 "$max_retries"); do
if output=$(docker pull "$image" 2>&1); then
echo "$output"
return 0
fi
echo "$output"

if echo "$output" | grep -qiE 'manifest (unknown|for .* not found)|repository does not exist|not found|pull access denied|unauthorized|denied: requested access|no space left on device'; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the testing strategy here? Could you present some of the outputs for cases that you managed to simulate? For sure some errors are tricky to simulate, but let's share the output for the easy cases at least.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the testing strategy I used.

Sources used to compose the regex

How the function was tested

1. Test script that fakes docker pull

Replaces docker with a stand-in we control.
Walks through every path: happy case, retry-then-success, all retries exhausted, and three "don't retry" patterns (manifest unknown, pull access denied, no space left on device).

→ Temporary errors retry up to 5× with 1/2/4/8s backoff; "don't retry" patterns fail after one attempt; happy path unchanged.

2. make test-e2e, forced to fail two ways

  • KUBERAY_VERSION=does-not-exist — non-retriable (tag doesn't exist).
  • KUBERAY_IMAGE pointed at an invalid hostname — retriable, exhausts
    all 5 attempts.

→ Both runs behaved as expected. Outputs are presented in the PR description.

3. Regex match against the gathered error patterns

Every error string from the "Sources" section above was run directly through grep -qiE '<regex>' — both the real CI flake outputs and the docs / OCI patterns that can't easily be reproduced locally.

→ Every non-retriable pattern matched; every retriable pattern did not. Both CI flakes (#10296 layer verification, #10257 ghcr.io token timeout) correctly fall through to retry.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sgtm, let's give it a try

echo "ERROR: docker pull '$image' failed with a non-retriable error."
return 1
fi

if [ "$attempt" -eq "$max_retries" ]; then
break
fi

echo "WARNING: docker pull '$image' failed (attempt $attempt/$max_retries). Retrying in ${retry_delay}s..."
sleep "$retry_delay"
retry_delay=$((retry_delay * 2))
done

echo "ERROR: Failed to pull '$image' after $max_retries attempts."
return 1
}

function e2e_deployment_exists {
Expand Down