MGMT-23738: Flaky test: "Move Agent to another infraenv" fails with Kubernetes conflict error#10319
Conversation
|
@shay23bra: This pull request references MGMT-23738 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
WalkthroughAdds optimistic-concurrency conflict handling when updating an existing Agent CR: the update is retried with a fresh read to obtain the latest ChangesAgent Update Conflict Resilience
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: shay23bra The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
internal/controller/controllers/crd_utils_test.go (1)
216-243: ⚡ Quick winAdd an exhaustion-path assertion to lock in DB consistency behavior.
Consider a companion case where
Updatealways returns conflict and assertUpdateKubeKeyNSis never called. That protects the new ordering guarantee from regressions.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/controller/controllers/crd_utils_test.go` around lines 216 - 243, Add a new test that simulates a permanent conflict by building a fake client (similar to conflictClient) whose Update interceptor always returns a k8serrors.NewConflict; call NewCRDUtils(...).CreateAgentCR(...) with that client and assert it returns an error and that mockHostApi.EXPECT().UpdateKubeKeyNS(...) is never invoked (Times(0)) to lock in the exhaustion-path DB-consistency behavior. Ensure you reference the same setup symbols (CreateAgentCR, NewCRDUtils, mockHostApi, Update interceptor) so the test mirrors the existing success-via-retry case but enforces the never-succeeds conflict path.internal/controller/controllers/crd_utils.go (1)
179-180: ⚡ Quick winWrap retry failure with host/namespace context.
At Line 180, returning the raw error makes production triage harder when retries are exhausted.
Proposed patch
- }); err != nil { - return err + }); err != nil { + return errors.Wrapf(err, "failed to update Agent CR %s/%s after conflict retries", infraEnvCR.Namespace, hostId) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/controller/controllers/crd_utils.go` around lines 179 - 180, The retry callback is returning the raw error when retries are exhausted; update the return to wrap that error with host and namespace context (e.g., return fmt.Errorf("operation failed for host %s namespace %s: %w", host, namespace, err)) so callers get actionable context; import fmt if needed and update the closure in crd_utils.go where the code currently does `}); err != nil { return err }` to wrap `err` with the host/namespace identifiers used in that scope.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@internal/controller/controllers/crd_utils_test.go`:
- Around line 216-243: Add a new test that simulates a permanent conflict by
building a fake client (similar to conflictClient) whose Update interceptor
always returns a k8serrors.NewConflict; call NewCRDUtils(...).CreateAgentCR(...)
with that client and assert it returns an error and that
mockHostApi.EXPECT().UpdateKubeKeyNS(...) is never invoked (Times(0)) to lock in
the exhaustion-path DB-consistency behavior. Ensure you reference the same setup
symbols (CreateAgentCR, NewCRDUtils, mockHostApi, Update interceptor) so the
test mirrors the existing success-via-retry case but enforces the never-succeeds
conflict path.
In `@internal/controller/controllers/crd_utils.go`:
- Around line 179-180: The retry callback is returning the raw error when
retries are exhausted; update the return to wrap that error with host and
namespace context (e.g., return fmt.Errorf("operation failed for host %s
namespace %s: %w", host, namespace, err)) so callers get actionable context;
import fmt if needed and update the closure in crd_utils.go where the code
currently does `}); err != nil { return err }` to wrap `err` with the
host/namespace identifiers used in that scope.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1c91dfdc-ce53-4041-97e0-58dd0b83e2e6
📒 Files selected for processing (2)
internal/controller/controllers/crd_utils.gointernal/controller/controllers/crd_utils_test.go
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #10319 +/- ##
=======================================
Coverage 44.33% 44.34%
=======================================
Files 417 417
Lines 72837 72844 +7
=======================================
+ Hits 32294 32301 +7
+ Misses 37609 37608 -1
- Partials 2934 2935 +1
🚀 New features to boost your workflow:
|
|
@shay23bra: No Jira issue is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@shay23bra: This pull request references MGMT-23738 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
…ove between infraenvs
a71c41e to
74bac7e
Compare
|
/retest-required |
|
@shay23bra: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
retry.RetryOnConflictaround the Agent CR update inCreateAgentCR()to handle Kubernetes optimistic concurrency conflicts when moving an agent between infraenvsUpdateKubeKeyNSand log statement to after the successful K8s update to avoid DB inconsistency if retries are exhaustederrors.Wrapfwrapped the wrong variable in theUpdateKubeKeyNSerror pathTest plan
go test ./internal/controller/controllers/ --ginkgo.focus="create agent CR"— 9/9 pass)go build ./...andgo vetpasspull-ci-openshift-assisted-service-master-edge-subsystem-kubeapi-awsjob should validate the "Move Agent to another infraenv" subsystem test no longer flakesSummary by CodeRabbit
Bug Fixes
Tests