Skip to content

ARO-26557: Fix etcd bootstrap race when replicas join after quorum#8425

Open
patriksuba wants to merge 1 commit into
openshift:mainfrom
patriksuba:aro-26557-etcd-bootstrap-race
Open

ARO-26557: Fix etcd bootstrap race when replicas join after quorum#8425
patriksuba wants to merge 1 commit into
openshift:mainfrom
patriksuba:aro-26557-etcd-bootstrap-race

Conversation

@patriksuba
Copy link
Copy Markdown

@patriksuba patriksuba commented May 5, 2026

What this PR does / why we need it:

Parallel etcd pods and delayed PVCs allow a majority to bootstrap before every replica starts. A late replica with an empty data dir must use member add and existing initial-cluster-state instead of new, or it can form a divergent cluster identity.

Change the reset-member init to detect that case (member list shows at least quorum-min members but not this hostname) and add the member before starting etcd.

Which issue(s) this PR fixes:

Fixes ARO-26557

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Improvements

    • Init logic for etcd membership now exposes expected-member and quorum values to startup; pods will attempt dynamic joins only when observed members meet the quorum and are below the expected size, refuse to grow past expected size, and log or skip attempts when quorum/env vars are insufficient.
  • Tests

    • Added unit tests validating quorum-size calculation and quorum-check behavior for dynamic join decisions.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 5, 2026

@patriksuba: This pull request references ARO-26557 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Parallel etcd pods and delayed PVCs allow a majority to bootstrap before every replica starts. A late replica with an empty data dir must use member add and existing initial-cluster-state instead of new, or it can form a divergent cluster identity.

Change the reset-member init to detect that case (member list shows at least quorum-min members but not this hostname) and add the member before starting etcd.

Which issue(s) this PR fixes:

Fixes ARO-26557

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added do-not-merge/needs-area area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release labels May 5, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: patriksuba
Once this PR has been reviewed and has the lgtm label, please assign jparrill for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

📝 Walkthrough

Walkthrough

This change adds quorum-size helper functions and tests, surfaces ETCD_EXPECTED_MEMBER_COUNT and ETCD_QUORUM_MIN_MEMBERS from adaptStatefulSet into the reset-member initContainer, and modifies the reset-member script to gather MEMBER_LIST/MEMBER_ID and conditionally run etcdctl member add only when current members meet the quorum and remain below the expected cluster size; otherwise it logs refusal, uses static bootstrap, or skips the straggler join.

Sequence Diagram(s)

sequenceDiagram
  participant HostedControlPlaneController
  participant reset-member_initContainer
  participant EtcdCluster
  HostedControlPlaneController->>reset-member_initContainer: set env ETCD_EXPECTED_MEMBER_COUNT, ETCD_QUORUM_MIN_MEMBERS
  reset-member_initContainer->>EtcdCluster: run etcdctl member list -> MEMBER_LIST
  reset-member_initContainer->>reset-member_initContainer: compute MEMBER_ID, MEMBER_COUNT
  alt MEMBER_COUNT >= ETCD_QUORUM_MIN_MEMBERS AND MEMBER_COUNT < ETCD_EXPECTED_MEMBER_COUNT
    reset-member_initContainer->>EtcdCluster: etcdctl member add <peerURL>
  else MEMBER_COUNT >= ETCD_EXPECTED_MEMBER_COUNT
    reset-member_initContainer->>reset-member_initContainer: log "refusing to grow cluster"
  else MEMBER_COUNT < ETCD_QUORUM_MIN_MEMBERS
    reset-member_initContainer->>reset-member_initContainer: log "static bootstrap (new) path"
  end
Loading
🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: fixing an etcd bootstrap race condition when replicas join after quorum is established.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR adds table-driven Go tests with entirely static, deterministic test case names. No dynamic values, timestamps, UUIDs, or generated identifiers in test titles.
Test Structure And Quality ✅ Passed Test file follows Go standard testing (not Ginkgo). Each test has single responsibility, meaningful assertion messages, and consistent patterns with existing codebase tests.
Microshift Test Compatibility ✅ Passed PR adds only standard Go unit tests (not Ginkgo e2e tests) with no MicroShift-incompatible APIs or assumptions. No Ginkgo e2e tests present, so check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests were added in this PR. The added test file (membership_test.go) contains only standard Go unit tests using func TestXxx(t *testing.T) pattern, not e2e Ginkgo tests.
Topology-Aware Scheduling Compatibility ✅ Passed Changes use topology-aware DefaultReplicas() for etcd (returns 1 for SNO, 3 otherwise). No affinity, nodeSelector, or problematic PDB constraints introduced.
Ote Binary Stdout Contract ✅ Passed PR modifies control-plane-operator controller code and standard Go unit tests; not OTE infrastructure. No stdout writes in process-level code or init functions detected.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR contains no Ginkgo e2e tests. Only standard Go unit tests (membership_test.go) added, which use testing.T and Gomega as assertion library, not Ginkgo framework.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 43.47826% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.40%. Comparing base (36dfb1b) to head (efbc1d6).

Files with missing lines Patch % Lines
...trollers/hostedcontrolplane/v2/etcd/statefulset.go 0.00% 13 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8425   +/-   ##
=======================================
  Coverage   40.40%   40.40%           
=======================================
  Files         755      756    +1     
  Lines       93235    93257   +22     
=======================================
+ Hits        37675    37685   +10     
- Misses      52858    52870   +12     
  Partials     2702     2702           
Files with missing lines Coverage Δ
...ntrollers/hostedcontrolplane/v2/etcd/membership.go 100.00% <100.00%> (ø)
...trollers/hostedcontrolplane/v2/etcd/statefulset.go 37.12% <0.00%> (-3.72%) ⬇️
Flag Coverage Δ
cmd-support 34.44% <ø> (ø)
cpo-hostedcontrolplane 41.77% <43.47%> (+<0.01%) ⬆️
cpo-other 40.31% <ø> (ø)
hypershift-operator 50.72% <ø> (ø)
other 31.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml (1)

228-248: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid re-querying membership after the initial successful probe.

At Line 240, a second etcdctl member list failure is masked (|| true), which can set MEMBER_COUNT=0 and force the Line 247 static "new" path even after Line 228 succeeded. That can reintroduce divergent bootstrap behavior during transient API/network blips.

Use one captured member list snapshot from the successful call and derive both MEMBER_ID and MEMBER_COUNT from it; if parsing fails, retry/fail init instead of falling back to "new".

Proposed patch
-            if etcdctl member list; then
+            if MEMBER_LIST="$(etcdctl member list -w simple)"; then
               echo "Cluster is functional"
-              MEMBER_ID=$(etcdctl member list -w simple | grep "${HOSTNAME}" | awk -F, '{ print $1 }')
+              MEMBER_ID=$(printf '%s\n' "${MEMBER_LIST}" | awk -F, -v host="${HOSTNAME}" '$2 == " name="host { print $1 }')
               if [[ -n "${MEMBER_ID}" ]]; then
                 echo "A member with this name (${HOSTNAME}) already exists, removing"
                 etcdctl member remove "${MEMBER_ID}"
                 echo "Adding new member"
                 etcdctl member add ${HOSTNAME} --peer-urls https://${HOSTNAME}.etcd-discovery.${NAMESPACE}.svc:2380
                 echo "existing" > /etc/etcd/clusterstate/state
               else
                 echo "A member does not exist with name (${HOSTNAME}), evaluating straggler join"
                 EXPECTED="${ETCD_EXPECTED_MEMBER_COUNT:-3}"
-                MEMBER_COUNT=$(etcdctl member list -w simple | grep -c . || true)
+                MEMBER_COUNT=$(printf '%s\n' "${MEMBER_LIST}" | grep -c .)
                 QUORUM="${ETCD_QUORUM_MIN_MEMBERS:-$(( EXPECTED / 2 + 1 ))}"
                 if [[ "${MEMBER_COUNT}" -ge "${QUORUM}" ]]; then
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml`
around lines 228 - 248, Capture the output of the successful etcdctl member list
probe into a single variable and reuse it for both MEMBER_ID and MEMBER_COUNT
instead of re-running etcdctl (replace the separate calls that set MEMBER_ID
with a parse of that snapshot and replace MEMBER_COUNT=$(etcdctl member list -w
simple | grep -c . || true) with a count derived from the same snapshot); if
parsing the snapshot to obtain MEMBER_ID or MEMBER_COUNT fails, retry the probe
a bounded number of times or fail initialization instead of falling back to the
static "new" path; update the code paths that reference MEMBER_ID, MEMBER_COUNT,
HOSTNAME, NAMESPACE, ETCD_EXPECTED_MEMBER_COUNT and ETCD_QUORUM_MIN_MEMBERS to
read from the parsed snapshot and add appropriate retry/fail logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml`:
- Around line 228-248: Capture the output of the successful etcdctl member list
probe into a single variable and reuse it for both MEMBER_ID and MEMBER_COUNT
instead of re-running etcdctl (replace the separate calls that set MEMBER_ID
with a parse of that snapshot and replace MEMBER_COUNT=$(etcdctl member list -w
simple | grep -c . || true) with a count derived from the same snapshot); if
parsing the snapshot to obtain MEMBER_ID or MEMBER_COUNT fails, retry the probe
a bounded number of times or fail initialization instead of falling back to the
static "new" path; update the code paths that reference MEMBER_ID, MEMBER_COUNT,
HOSTNAME, NAMESPACE, ETCD_EXPECTED_MEMBER_COUNT and ETCD_QUORUM_MIN_MEMBERS to
read from the parsed snapshot and add appropriate retry/fail logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 17a637b3-3f78-48e6-910c-3710a4c6befe

📥 Commits

Reviewing files that changed from the base of the PR and between e09cc2d and 60e67fe.

⛔ Files ignored due to path filters (5)
  • control-plane-operator/controllers/hostedcontrolplane/testdata/etcd/AROSwift/zz_fixture_TestControlPlaneComponents_etcd_statefulset.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/etcd/GCP/zz_fixture_TestControlPlaneComponents_etcd_statefulset.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/etcd/IBMCloud/zz_fixture_TestControlPlaneComponents_etcd_statefulset.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/etcd/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_etcd_statefulset.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/etcd/zz_fixture_TestControlPlaneComponents_etcd_statefulset.yaml is excluded by !**/testdata/**
📒 Files selected for processing (4)
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/membership.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/membership_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go

@patriksuba patriksuba force-pushed the aro-26557-etcd-bootstrap-race branch from 60e67fe to 26b1ba3 Compare May 15, 2026 14:13
@patriksuba
Copy link
Copy Markdown
Author

/test all

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml`:
- Around line 240-247: The current logic calls etcdctl member add when
MEMBER_COUNT >= ETCD_QUORUM_MIN_MEMBERS, but it doesn't ensure the live cluster
is still smaller than the desired ETCD_EXPECTED_MEMBER_COUNT; update the
conditional that decides to run etcdctl member add (the block referencing
MEMBER_COUNT, ETCD_QUORUM_MIN_MEMBERS, ETCD_EXPECTED_MEMBER_COUNT and etcdctl
member add ${HOSTNAME} --peer-urls ...) to require both that MEMBER_COUNT >=
ETCD_QUORUM_MIN_MEMBERS and MEMBER_COUNT < ETCD_EXPECTED_MEMBER_COUNT before
adding this member, otherwise fall back to the static bootstrap path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5dcf96e9-c079-427e-a61f-f8e24e391aa2

📥 Commits

Reviewing files that changed from the base of the PR and between 60e67fe and 26b1ba3.

⛔ Files ignored due to path filters (5)
  • control-plane-operator/controllers/hostedcontrolplane/testdata/etcd/AROSwift/zz_fixture_TestControlPlaneComponents_etcd_statefulset.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/etcd/GCP/zz_fixture_TestControlPlaneComponents_etcd_statefulset.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/etcd/IBMCloud/zz_fixture_TestControlPlaneComponents_etcd_statefulset.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/etcd/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_etcd_statefulset.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/etcd/zz_fixture_TestControlPlaneComponents_etcd_statefulset.yaml is excluded by !**/testdata/**
📒 Files selected for processing (4)
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/membership.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/membership_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/membership_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/membership.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go

@patriksuba patriksuba closed this May 15, 2026
@patriksuba patriksuba force-pushed the aro-26557-etcd-bootstrap-race branch from 26b1ba3 to eaeb257 Compare May 15, 2026 15:53
@patriksuba patriksuba reopened this May 15, 2026
@patriksuba patriksuba marked this pull request as ready for review May 18, 2026 09:19
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 18, 2026
@openshift-ci openshift-ci Bot requested review from devguyio and sjenning May 18, 2026 09:20
@patriksuba
Copy link
Copy Markdown
Author

/test e2e-azure-v2-self-managed

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 20, 2026

@patriksuba: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-v2-self-managed eaeb257 link false /test e2e-azure-v2-self-managed

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

Good — PR #8425 only touches etcd-related files. It has nothing to do with the missing create-guests binary. Now I have the complete picture for both failures. Let me write the final report.

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-e2e-azure-v2-self-managed
  • Build ID: 2057066384487092224
  • Target: e2e-azure-v2-self-managed
  • Job Type: presubmit (optional)
  • PR: ARO-26557: Fix etcd bootstrap race when replicas join after quorum #8425 (ARO-26557: Fix etcd bootstrap race when replicas join after quorum)
  • Start: 2026-05-20T11:50:09Z
  • Duration: ~57 minutes
  • Konflux Job: hypershift-operator-main-on-pull-request — failed due to unrelated image pull error (quay.io/konflux-ci/hermeto:0.47.0ImagePullBackOff)

Test Failure Analysis

Error

/bin/bash: line 11: /hypershift/bin/create-guests: No such file or directory
exit status 127

Summary

Both CI failures are not caused by PR #8425's code changes. The Prow job (e2e-azure-v2-self-managed) fails because the CI step hypershift-azure-create-selfmanaged-guests invokes /hypershift/bin/create-guests, a binary that does not exist in the hypershift-tests container image and has never been built on the main branch. The CI workflow (openshift/release PR #77674, merged May 8) was deployed before the companion code PR (#8527, CNTRLPLANE-3222) that introduces the create-guests, destroy-guests, and dump-guests Go source code and the corresponding Dockerfile.e2e / Makefile changes. The Konflux failure is a transient infrastructure issue (image pull backoff for quay.io/konflux-ci/hermeto:0.47.0).

Root Cause

Prow job (e2e-azure-v2-self-managed): The workflow hypershift-azure-e2e-v2-self-managed was introduced in openshift/release via two PRs (#77674 on May 8, #79347 on May 18). These PRs created CI step definitions that invoke three binaries from the hypershift-tests image:

  • /hypershift/bin/create-guests (pre phase — hypershift-azure-create-selfmanaged-guests)
  • /hypershift/bin/dump-guests (post phase — dump step)
  • /hypershift/bin/destroy-guests (post phase — destroy step)

However, the source code for these binaries (test/e2e/v2/cmd/create-guests/main.go, test/e2e/v2/cmd/destroy-guests/main.go, test/e2e/v2/cmd/dump-guests/main.go) and the Dockerfile.e2e / Makefile changes to build and package them exist only in the still-open PR openshift/hypershift#8527 (CNTRLPLANE-3222: Port v1 lifecycle tests to v2 Ginkgo framework), which has not been merged into main.

This means the e2e-azure-v2-self-managed job is broken on main for every PR — it is not a regression introduced by PR #8425. The job is marked optional: true and always_run: false, so it does not gate merges.

Konflux job (hypershift-operator-main-on-pull-request): The prefetch-dependencies task failed with TaskRunImagePullFailed because the image quay.io/konflux-ci/hermeto:0.47.0 could not be pulled (back-off). This is a transient Konflux infrastructure issue completely unrelated to the PR's code changes.

Recommendations
  1. Both failures can be safely ignored for PR ARO-26557: Fix etcd bootstrap race when replicas join after quorum #8425 — neither is caused by the etcd bootstrap race fix. The PR only modifies etcd statefulset assets and membership logic.

  2. To fix the Prow job: Merge PR #8527 (CNTRLPLANE-3222), which adds the missing create-guests, destroy-guests, and dump-guests source code along with the Dockerfile.e2e and Makefile changes needed to build them into the hypershift-tests image.

  3. For the Konflux failure: No action needed — retry will likely succeed once the transient image pull issue resolves. This is an infrastructure issue with the quay.io/konflux-ci/hermeto image.

  4. Consider disabling the broken job: Until PR CNTRLPLANE-3222: Port v1 lifecycle tests to v2 Ginkgo framework #8527 merges, the e2e-azure-v2-self-managed job will fail on every PR it runs against. Since it's already optional: true, this is low impact, but it adds noise to PR CI results.

Evidence
Evidence Detail
Failing step hypershift-azure-create-selfmanaged-guests (pre phase)
Error /bin/bash: line 11: /hypershift/bin/create-guests: No such file or directory (exit code 127)
Step image hypershift-tests (built from Dockerfile.e2e)
Missing binaries create-guests, dump-guests, destroy-guests — not in Dockerfile.e2e on main
Source code location test/e2e/v2/cmd/create-guests/main.go — exists only in unmerged PR #8527
CI workflow PRs openshift/release #77674 (May 8) and #79347 (May 18)
Job config optional: true, always_run: false — does not gate merges
PR #8425 files changed Only etcd-related: v2/etcd/membership.go, v2/etcd/statefulset.go, v2/assets/etcd/statefulset.yaml, test fixtures
Konflux failure TaskRunImagePullFailedquay.io/konflux-ci/hermeto:0.47.0 image pull backoff (infrastructure issue)
Also failing post steps dump and destroy-guests — same missing binary issue (dump-guests, destroy-guests)

Parallel etcd pods and delayed PVCs allow a majority to bootstrap before every replica starts. A late replica with an empty data dir must
use member add and existing initial-cluster-state instead of new, or it can form a divergent cluster identity.

Change the reset-member init to detect that case (member list shows at least quorum-min members but not this hostname) and add the member
before starting etcd. Prevent adding etcd cluster members beyond the expected limit.
@patriksuba patriksuba force-pushed the aro-26557-etcd-bootstrap-race branch from eaeb257 to efbc1d6 Compare May 21, 2026 13:33
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
control-plane-operator/controllers/hostedcontrolplane/v2/etcd/membership.go (1)

16-21: ⚡ Quick win

resetMemberStragglerJoinQuorumMet has no production caller — logic is duplicated in bash.

This helper exists only to be exercised by membership_test.go. The equivalent decision (MEMBER_COUNT >= ETCD_QUORUM_MIN_MEMBERS && MEMBER_COUNT < ETCD_EXPECTED_MEMBER_COUNT) is reimplemented in assets/etcd/statefulset.yaml. Testing a helper that no production code path uses gives false confidence: the bash branch can drift without any test failure.

Two reasonable options:

  • Drop this helper (and its tests) and add a bash-level test for the script instead, or
  • Use it from adaptStatefulSet to project the boolean (or use it to document/derive the env var contract), so the Go logic and bash logic stay tied to a single source of truth.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@control-plane-operator/controllers/hostedcontrolplane/v2/etcd/membership.go`
around lines 16 - 21, The helper resetMemberStragglerJoinQuorumMet is
unreferenced in production and duplicates logic in the bash manifest; either
remove the helper and its membership_test.go tests and add a test that validates
the bash script/manifest behavior, or (preferred) call
resetMemberStragglerJoinQuorumMet from adaptStatefulSet so the code computes the
same boolean used to set the ETCD env var/manifest value and membership_test.go
stays as unit coverage for the real logic—update adaptStatefulSet to invoke
resetMemberStragglerJoinQuorumMet(expectedClusterSize, memberCount) (or adjust
parameter order to match) and use its return to derive/set the environment
variable or field written to the StatefulSet, and remove the duplicated
hard-coded expression from assets/etcd/statefulset.yaml generation so there is a
single source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml`:
- Around line 230-231: The current MEMBER_ID extraction uses grep "${HOSTNAME}"
which can match substrings/regex and return wrong rows; replace that grep-based
extraction by selecting the exact member-name column from MEMBER_LIST using awk
to compare field 3 to HOSTNAME (i.e., use MEMBER_LIST assignment and then
compute MEMBER_ID with awk -F, -v name="${HOSTNAME}" '$3 == name { print $1 }');
update the lines referencing MEMBER_LIST and MEMBER_ID (the two lines that set
MEMBER_LIST and MEMBER_ID) so MEMBER_ID is derived with the exact-match awk
approach to avoid partial/regex matches.

---

Nitpick comments:
In `@control-plane-operator/controllers/hostedcontrolplane/v2/etcd/membership.go`:
- Around line 16-21: The helper resetMemberStragglerJoinQuorumMet is
unreferenced in production and duplicates logic in the bash manifest; either
remove the helper and its membership_test.go tests and add a test that validates
the bash script/manifest behavior, or (preferred) call
resetMemberStragglerJoinQuorumMet from adaptStatefulSet so the code computes the
same boolean used to set the ETCD env var/manifest value and membership_test.go
stays as unit coverage for the real logic—update adaptStatefulSet to invoke
resetMemberStragglerJoinQuorumMet(expectedClusterSize, memberCount) (or adjust
parameter order to match) and use its return to derive/set the environment
variable or field written to the StatefulSet, and remove the duplicated
hard-coded expression from assets/etcd/statefulset.yaml generation so there is a
single source of truth.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3dc7c258-03b8-4d2a-be4e-84061ec222e5

📥 Commits

Reviewing files that changed from the base of the PR and between eaeb257 and efbc1d6.

⛔ Files ignored due to path filters (5)
  • control-plane-operator/controllers/hostedcontrolplane/testdata/etcd/AROSwift/zz_fixture_TestControlPlaneComponents_etcd_statefulset.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/etcd/GCP/zz_fixture_TestControlPlaneComponents_etcd_statefulset.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/etcd/IBMCloud/zz_fixture_TestControlPlaneComponents_etcd_statefulset.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/etcd/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_etcd_statefulset.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/etcd/zz_fixture_TestControlPlaneComponents_etcd_statefulset.yaml is excluded by !**/testdata/**
📒 Files selected for processing (4)
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/membership.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/membership_test.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go

Comment on lines +230 to +231
MEMBER_LIST=$(etcdctl member list -w simple)
MEMBER_ID=$(echo "${MEMBER_LIST}" | grep "${HOSTNAME}" | awk -F, '{ print $1 }')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

grep "${HOSTNAME}" is a substring/regex match across the member list.

grep "${HOSTNAME}" matches any line containing HOSTNAME, and treats it as a regex. For etcd member names sharing prefixes (e.g. etcd-1 would also match a line for etcd-10/etcd-11), this can pick up the wrong row and yield a bogus MEMBER_ID (or a multi-line value that breaks the subsequent etcdctl member remove). Today HyperShift caps etcd at a small replica count so the collision is unlikely in practice, but since this line is being modified it’s cheap to harden.

Two safer options:

  • Anchor the match on the member-list column (with -w simple, the member name is field 3): awk -F, -v name="${HOSTNAME}" '$3 == name { print $1 }'
  • Or use grep -F -w -- "${HOSTNAME}" to disable regex and require a whole-word match (note: -w still treats - as a word boundary, so prefer the awk form for full safety).
♻️ Proposed change
-              MEMBER_LIST=$(etcdctl member list -w simple)
-              MEMBER_ID=$(echo "${MEMBER_LIST}" | grep "${HOSTNAME}" | awk -F, '{ print $1 }')
+              MEMBER_LIST=$(etcdctl member list -w simple)
+              MEMBER_ID=$(echo "${MEMBER_LIST}" | awk -F, -v name="${HOSTNAME}" '$3 == name { print $1 }')
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
MEMBER_LIST=$(etcdctl member list -w simple)
MEMBER_ID=$(echo "${MEMBER_LIST}" | grep "${HOSTNAME}" | awk -F, '{ print $1 }')
MEMBER_LIST=$(etcdctl member list -w simple)
MEMBER_ID=$(echo "${MEMBER_LIST}" | awk -F, -v name="${HOSTNAME}" '$3 == name { print $1 }')
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml`
around lines 230 - 231, The current MEMBER_ID extraction uses grep "${HOSTNAME}"
which can match substrings/regex and return wrong rows; replace that grep-based
extraction by selecting the exact member-name column from MEMBER_LIST using awk
to compare field 3 to HOSTNAME (i.e., use MEMBER_LIST assignment and then
compute MEMBER_ID with awk -F, -v name="${HOSTNAME}" '$3 == name { print $1 }');
update the lines referencing MEMBER_LIST and MEMBER_ID (the two lines that set
MEMBER_LIST and MEMBER_ID) so MEMBER_ID is derived with the exact-match awk
approach to avoid partial/regex matches.

@patriksuba
Copy link
Copy Markdown
Author

/assign @jparrill

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants