Skip to content

MGMT-23738: Flaky test: "Move Agent to another infraenv" fails with Kubernetes conflict error#10319

Open
shay23bra wants to merge 1 commit into
openshift:masterfrom
shay23bra:MGMT-23738-Flaky-test-Move-Agent-to-another-infraenv-fails-with-Kubernetes-conflict-error
Open

MGMT-23738: Flaky test: "Move Agent to another infraenv" fails with Kubernetes conflict error#10319
shay23bra wants to merge 1 commit into
openshift:masterfrom
shay23bra:MGMT-23738-Flaky-test-Move-Agent-to-another-infraenv-fails-with-Kubernetes-conflict-error

Conversation

@shay23bra
Copy link
Copy Markdown
Contributor

@shay23bra shay23bra commented May 12, 2026

Summary

  • Added retry.RetryOnConflict around the Agent CR update in CreateAgentCR() to handle Kubernetes optimistic concurrency conflicts when moving an agent between infraenvs
  • Moved UpdateKubeKeyNS and log statement to after the successful K8s update to avoid DB inconsistency if retries are exhausted
  • Fixed pre-existing bug where errors.Wrapf wrapped the wrong variable in the UpdateKubeKeyNS error path
  • Added unit test verifying retry behavior on conflict using a fake client with interceptor

Test plan

  • Unit tests pass locally (go test ./internal/controller/controllers/ --ginkgo.focus="create agent CR" — 9/9 pass)
  • New test case "Already existing agent update retries on conflict" validates the retry logic
  • go build ./... and go vet pass
  • CI: pull-ci-openshift-assisted-service-master-edge-subsystem-kubeapi-aws job should validate the "Move Agent to another infraenv" subsystem test no longer flakes

Summary by CodeRabbit

  • Bug Fixes

    • Improved robustness of Agent resource updates by adding automatic retry handling for concurrent modification conflicts, reducing failures during simultaneous updates and ensuring metadata and related host updates proceed reliably.
  • Tests

    • Added test coverage that simulates update conflicts and verifies retries succeed, preventing regressions in conflict-handling behavior.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 12, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 12, 2026

@shay23bra: This pull request references MGMT-23738 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Added retry.RetryOnConflict around the Agent CR update in CreateAgentCR() to handle Kubernetes optimistic concurrency conflicts when moving an agent between infraenvs
  • Moved UpdateKubeKeyNS and log statement to after the successful K8s update to avoid DB inconsistency if retries are exhausted
  • Fixed pre-existing bug where errors.Wrapf wrapped the wrong variable in the UpdateKubeKeyNS error path
  • Added unit test verifying retry behavior on conflict using a fake client with interceptor

Test plan

  • Unit tests pass locally (go test ./internal/controller/controllers/ --ginkgo.focus="create agent CR" — 9/9 pass)
  • New test case "Already existing agent update retries on conflict" validates the retry logic
  • go build ./... and go vet pass
  • CI: pull-ci-openshift-assisted-service-master-edge-subsystem-kubeapi-aws job should validate the "Move Agent to another infraenv" subsystem test no longer flakes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c16a2f21-7737-495e-ae3e-f34988775829

📥 Commits

Reviewing files that changed from the base of the PR and between a71c41e and 74bac7e.

📒 Files selected for processing (2)
  • internal/controller/controllers/crd_utils.go
  • internal/controller/controllers/crd_utils_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/controller/controllers/crd_utils_test.go
  • internal/controller/controllers/crd_utils.go

Walkthrough

Adds optimistic-concurrency conflict handling when updating an existing Agent CR: the update is retried with a fresh read to obtain the latest ResourceVersion. A unit test simulates a conflict on the first update and verifies the retry succeeds.

Changes

Agent Update Conflict Resilience

Layer / File(s) Summary
Retry-enabled Agent update entry
internal/controller/controllers/crd_utils.go
Introduces updateAgentCR helper and wraps the Agent Update in retry.RetryOnConflict. Each retry re-fetches the current Agent to get the latest ResourceVersion, rebuilds ObjectMeta (labels, owner refs, ResourceVersion), clears/reset spec fields (Approved=false, host/role fields), conditionally sets Spec.ClusterDeploymentName, and attempts the update again.
Imports supporting retry
internal/controller/controllers/crd_utils.go
Adds k8s.io/client-go/util/retry import to enable conflict-retry behavior.
Conflict-retry unit test
internal/controller/controllers/crd_utils_test.go
Adds a Ginkgo test that pre-creates Cluster/InfraEnv/Agent, wraps the fake controller-runtime client with an interceptor.Funcs Update hook that returns a Conflict error on the first Update and delegates thereafter, invokes CreateAgentCR, asserts no error, and verifies the update was attempted twice. New imports (e.g., k8serrors, schema, types, client/interceptor, fmt) support conflict simulation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Test lacks meaningful assertion messages. Line 241 and 242 assertions provide no diagnostic context on failure, violating requirement #4 of the custom check. Add failure messages to assertions: Expect(err).NotTo(HaveOccurred(), "CreateAgentCR should succeed with conflict retry") and Expect(updateCount).To(Equal(2), "update should be retried exactly once on conflict").
✅ Passed checks (11 passed)
Check name Status Explanation
Title check ✅ Passed The title directly references the issue being fixed (MGMT-23738) and clearly identifies the problem: a flaky test caused by Kubernetes conflict errors when moving agents between infraenvs.
Description check ✅ Passed The PR description is comprehensive and covers all critical sections: summary of changes, test plan with specific commands, local testing results, and CI validation status.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names are stable and deterministic with no dynamic values. The new test follows proper conventions with dynamic values confined to test bodies.
Microshift Test Compatibility ✅ Passed This is a unit test using fakeclient in internal/controller/, not an e2e test. The check only applies to e2e tests, so it is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The test added is a unit test using a fake Kubernetes client, not an e2e test. The custom check applies only to e2e tests. The test does not make multi-node cluster assumptions.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies controller utility functions to add retry logic for Agent CR updates. No deployment manifests, pod scheduling constraints, affinity rules, or topology assumptions introduced.
Ote Binary Stdout Contract ✅ Passed No OTE Binary Stdout Contract violations found. The only fmt.Errorf call (line 231) is inside an It() test block, which is explicitly exempt from the check. Suite logging is properly configured.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This is a unit test, not an e2e test. The custom check targets e2e tests for IPv6/disconnected network compatibility. Unit tests using fake clients are not subject to these constraints.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 12, 2026
@openshift-ci openshift-ci Bot requested review from gamli75 and javipolo May 12, 2026 10:23
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 12, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shay23bra

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 12, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
internal/controller/controllers/crd_utils_test.go (1)

216-243: ⚡ Quick win

Add an exhaustion-path assertion to lock in DB consistency behavior.

Consider a companion case where Update always returns conflict and assert UpdateKubeKeyNS is never called. That protects the new ordering guarantee from regressions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/controllers/crd_utils_test.go` around lines 216 - 243,
Add a new test that simulates a permanent conflict by building a fake client
(similar to conflictClient) whose Update interceptor always returns a
k8serrors.NewConflict; call NewCRDUtils(...).CreateAgentCR(...) with that client
and assert it returns an error and that
mockHostApi.EXPECT().UpdateKubeKeyNS(...) is never invoked (Times(0)) to lock in
the exhaustion-path DB-consistency behavior. Ensure you reference the same setup
symbols (CreateAgentCR, NewCRDUtils, mockHostApi, Update interceptor) so the
test mirrors the existing success-via-retry case but enforces the never-succeeds
conflict path.
internal/controller/controllers/crd_utils.go (1)

179-180: ⚡ Quick win

Wrap retry failure with host/namespace context.

At Line 180, returning the raw error makes production triage harder when retries are exhausted.

Proposed patch
-		}); err != nil {
-			return err
+		}); err != nil {
+			return errors.Wrapf(err, "failed to update Agent CR %s/%s after conflict retries", infraEnvCR.Namespace, hostId)
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/controllers/crd_utils.go` around lines 179 - 180, The
retry callback is returning the raw error when retries are exhausted; update the
return to wrap that error with host and namespace context (e.g., return
fmt.Errorf("operation failed for host %s namespace %s: %w", host, namespace,
err)) so callers get actionable context; import fmt if needed and update the
closure in crd_utils.go where the code currently does `}); err != nil { return
err }` to wrap `err` with the host/namespace identifiers used in that scope.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/controller/controllers/crd_utils_test.go`:
- Around line 216-243: Add a new test that simulates a permanent conflict by
building a fake client (similar to conflictClient) whose Update interceptor
always returns a k8serrors.NewConflict; call NewCRDUtils(...).CreateAgentCR(...)
with that client and assert it returns an error and that
mockHostApi.EXPECT().UpdateKubeKeyNS(...) is never invoked (Times(0)) to lock in
the exhaustion-path DB-consistency behavior. Ensure you reference the same setup
symbols (CreateAgentCR, NewCRDUtils, mockHostApi, Update interceptor) so the
test mirrors the existing success-via-retry case but enforces the never-succeeds
conflict path.

In `@internal/controller/controllers/crd_utils.go`:
- Around line 179-180: The retry callback is returning the raw error when
retries are exhausted; update the return to wrap that error with host and
namespace context (e.g., return fmt.Errorf("operation failed for host %s
namespace %s: %w", host, namespace, err)) so callers get actionable context;
import fmt if needed and update the closure in crd_utils.go where the code
currently does `}); err != nil { return err }` to wrap `err` with the
host/namespace identifiers used in that scope.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1c91dfdc-ce53-4041-97e0-58dd0b83e2e6

📥 Commits

Reviewing files that changed from the base of the PR and between babc11d and a71c41e.

📒 Files selected for processing (2)
  • internal/controller/controllers/crd_utils.go
  • internal/controller/controllers/crd_utils_test.go

@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Codecov Report

❌ Patch coverage is 57.14286% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.34%. Comparing base (babc11d) to head (74bac7e).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
internal/controller/controllers/crd_utils.go 57.14% 3 Missing and 3 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #10319   +/-   ##
=======================================
  Coverage   44.33%   44.34%           
=======================================
  Files         417      417           
  Lines       72837    72844    +7     
=======================================
+ Hits        32294    32301    +7     
+ Misses      37609    37608    -1     
- Partials     2934     2935    +1     
Files with missing lines Coverage Δ
internal/controller/controllers/crd_utils.go 72.86% <57.14%> (-0.91%) ⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@shay23bra shay23bra changed the title MGMT-23738: Add RetryOnConflict to CreateAgentCR to fix flaky agent move between infraenvs MGMT-23738 Flaky test: "Move Agent to another infraenv" fails with Kubernetes conflict error May 12, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@shay23bra: No Jira issue is referenced in the title of this pull request.
To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

Details

In response to this:

Summary

  • Added retry.RetryOnConflict around the Agent CR update in CreateAgentCR() to handle Kubernetes optimistic concurrency conflicts when moving an agent between infraenvs
  • Moved UpdateKubeKeyNS and log statement to after the successful K8s update to avoid DB inconsistency if retries are exhausted
  • Fixed pre-existing bug where errors.Wrapf wrapped the wrong variable in the UpdateKubeKeyNS error path
  • Added unit test verifying retry behavior on conflict using a fake client with interceptor

Test plan

  • Unit tests pass locally (go test ./internal/controller/controllers/ --ginkgo.focus="create agent CR" — 9/9 pass)
  • New test case "Already existing agent update retries on conflict" validates the retry logic
  • go build ./... and go vet pass
  • CI: pull-ci-openshift-assisted-service-master-edge-subsystem-kubeapi-aws job should validate the "Move Agent to another infraenv" subsystem test no longer flakes

Summary by CodeRabbit

  • Bug Fixes

  • Enhanced the resilience of Agent resource updates by implementing automatic retry logic that handles concurrent modification conflicts, preventing update failures in high-concurrency scenarios.

  • Tests

  • Added test coverage for Agent update conflict handling.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 12, 2026
@shay23bra shay23bra changed the title MGMT-23738 Flaky test: "Move Agent to another infraenv" fails with Kubernetes conflict error MGMT-23738: Flaky test: "Move Agent to another infraenv" fails with Kubernetes conflict error May 12, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 12, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 12, 2026

@shay23bra: This pull request references MGMT-23738 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Added retry.RetryOnConflict around the Agent CR update in CreateAgentCR() to handle Kubernetes optimistic concurrency conflicts when moving an agent between infraenvs
  • Moved UpdateKubeKeyNS and log statement to after the successful K8s update to avoid DB inconsistency if retries are exhausted
  • Fixed pre-existing bug where errors.Wrapf wrapped the wrong variable in the UpdateKubeKeyNS error path
  • Added unit test verifying retry behavior on conflict using a fake client with interceptor

Test plan

  • Unit tests pass locally (go test ./internal/controller/controllers/ --ginkgo.focus="create agent CR" — 9/9 pass)
  • New test case "Already existing agent update retries on conflict" validates the retry logic
  • go build ./... and go vet pass
  • CI: pull-ci-openshift-assisted-service-master-edge-subsystem-kubeapi-aws job should validate the "Move Agent to another infraenv" subsystem test no longer flakes

Summary by CodeRabbit

  • Bug Fixes

  • Enhanced the resilience of Agent resource updates by implementing automatic retry logic that handles concurrent modification conflicts, preventing update failures in high-concurrency scenarios.

  • Tests

  • Added test coverage for Agent update conflict handling.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@shay23bra shay23bra force-pushed the MGMT-23738-Flaky-test-Move-Agent-to-another-infraenv-fails-with-Kubernetes-conflict-error branch from a71c41e to 74bac7e Compare May 12, 2026 12:34
@shay23bra
Copy link
Copy Markdown
Contributor Author

/retest-required

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 13, 2026

@shay23bra: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/edge-e2e-ai-operator-ztp 74bac7e link true /test edge-e2e-ai-operator-ztp
ci/prow/verify-generated-code 74bac7e link true /test verify-generated-code
ci/prow/edge-e2e-ai-operator-ztp-capi 74bac7e link true /test edge-e2e-ai-operator-ztp-capi
ci/prow/edge-e2e-ai-operator-disconnected-capi 74bac7e link true /test edge-e2e-ai-operator-disconnected-capi

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants