MGMT-22370: Add exponential backoff to agent image pull#10337
MGMT-22370: Add exponential backoff to agent image pull#10337yoavsc0302 wants to merge 1 commit into
Conversation
When the agent's container image pull fails, systemd retries every 3 seconds forever (RestartSec=3, StartLimitInterval=0), flooding the registry. In staging, one host made 1836 attempts in 1.5 hours. Add a wrapper script (agent-pull-image) that handles image pulling with exponential backoff (5s → 10s → 20s → 40s → 80s → 160s → 300s cap). The script skips the pull entirely if the image is already cached locally, and never stops retrying. This reduces registry load from ~1200 to ~48 attempts/hour per host (~96% reduction).
|
@yoavsc0302: This pull request references MGMT-22370 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
WalkthroughThis PR adds a container image pull mechanism with retry logic to the agent discovery and startup process. A new shell script with exponential backoff is defined, passed through the ignition template system, provisioned to disk, and invoked as a systemd unit pre-start step. ChangesAgent image pull and service startup
🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: yoavsc0302 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #10337 +/- ##
==========================================
- Coverage 44.32% 44.32% -0.01%
==========================================
Files 417 417
Lines 72762 72763 +1
==========================================
- Hits 32253 32252 -1
- Misses 37589 37591 +2
Partials 2920 2920
🚀 New features to boost your workflow:
|
|
@yoavsc0302: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
When the agent's container image pull fails, systemd retries every 3 seconds forever (RestartSec=3, StartLimitInterval=0), flooding the registry. In staging, one host made 1836 attempts in 1.5 hours.
Add a wrapper script (agent-pull-image) that handles image pulling with exponential backoff (5s → 10s → 20s → 40s → 80s → 160s → 300s cap). The script skips the pull entirely if the image is already cached locally, and never stops retrying.
This reduces registry load from ~1200 to ~48 attempts/hour per host (~96% reduction).
List all the issues related to this PR
What environments does this code impact?
How was this code tested?
Checklist
docs, README, etc)Reviewers Checklist
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes