Skip to content

MGMT-24352: Reset finalizing timeout on transition from installing-pending-user-action#10293

Open
carbonin wants to merge 1 commit into
openshift:masterfrom
carbonin:pending-user-action-finalizing-reset-timeout
Open

MGMT-24352: Reset finalizing timeout on transition from installing-pending-user-action#10293
carbonin wants to merge 1 commit into
openshift:masterfrom
carbonin:pending-user-action-finalizing-reset-timeout

Conversation

@carbonin
Copy link
Copy Markdown
Member

@carbonin carbonin commented May 8, 2026

When a non-essential worker gets stuck in installing-pending-user-action the rest of the cluster can still progress to finalizing. The cluster-level Done finalization timeout (70 min) can then expire before the per-host timeout chain completes (40 min Rebooting + 60 min pending-user-action = 100 min). In this situation, when the stuck host moves from installing-pending-user-action to either error or installing-in-progress the entire cluster will fail instead of evicting the stuck host and completing with the remaining nodes.

This commit changesthe cluster state transition so it will reset progress_finalizing_stage_started_at and progress_finalizing_stage_timed_out when transitioning FROM InstallingPendingUserAction TO Finalizing. This gives the cluster a fresh 70-minute timeout window, allowing sufficient time for the host to either finish installation (if the boot issue was fixed) or error (in which case the cluster will succeed without it).

List all the issues related to this PR

Resolves https://redhat.atlassian.net/browse/MGMT-24352

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

WIP - will test manually

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see [CONTRIBUTING] guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Fixed cluster finalization stage timeout tracking to properly reset when transitioning from a pending user action state. This ensures clusters can complete the finalization process without prematurely timing out.

…ction

When a non-essential worker gets stuck in installing-pending-user-action
the rest of the cluster can still progress to finalizing. The cluster-level Done
finalization timeout (70 min) can then expire before the per-host timeout chain
completes (40 min Rebooting + 60 min pending-user-action = 100 min).
In this situation, when the stuck host moves from `installing-pending-user-action`
to either `error` or `installing-in-progress` the entire cluster will fail
instead of evicting the stuck host and completing with the remaining nodes.

This commit changesthe cluster state transition so it will reset
progress_finalizing_stage_started_at and progress_finalizing_stage_timed_out
when transitioning FROM InstallingPendingUserAction TO Finalizing. This gives
the cluster a fresh 70-minute timeout window, allowing sufficient time for
the host to either finish installation (if the boot issue was fixed) or
error (in which case the cluster will succeed without it).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Resolves https://redhat.atlassian.net/browse/MGMT-24352
@carbonin carbonin added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 8, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

Walkthrough

The PR adds logic to reset finalizing-stage timeout tracking when a cluster transitions from InstallingPendingUserAction to Finalizing status. The implementation sets the started-at timestamp to now and clears the timed-out flag. Corresponding test cases verify this behavior occurs only for the specified transition path.

Changes

Finalizing-Stage Timeout Reset on Transition

Layer / File(s) Summary
Core Logic Implementation
internal/cluster/transition.go
When transitioning to Finalizing from InstallingPendingUserAction, progress_finalizing_stage_started_at is reset to time.Now() and progress_finalizing_stage_timed_out is set to false.
Test Coverage
internal/cluster/transition_test.go
New test suite Finalizing stage timeout reset validates that the timeout fields are reset for InstallingPendingUserActionFinalizing transitions and confirms they are not reset for InstallingFinalizing transitions.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Tests lack meaningful assertion messages. All assertions in both test cases lack failure messages required by the check (e.g., missing descriptive text). Criteria 1-3 and 5 pass. Add assertion messages to all Expect() calls with descriptions of what went wrong. Example: Expect(err).ShouldNot(HaveOccurred(), "RefreshStatus failed"). Note: Consistent with existing codebase patterns.
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: resetting finalizing timeout when transitioning from installing-pending-user-action, which matches the primary objective of the PR.
Description check ✅ Passed The description is comprehensive and addresses all required template sections: clear problem statement, proposed solution, issue linking, appropriate classification as bug fix, environment impact, testing approach, and completed checklist items.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names use stable, descriptive static strings without dynamic values (timestamps, UUIDs, pod/node names). Test titles clearly indicate what each validates.
Microshift Test Compatibility ✅ Passed PR adds unit tests using local DB and mocks, not e2e tests. MicroShift check targets e2e tests against actual clusters. Check not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed New tests are unit tests in internal/cluster/transition_test.go, not e2e tests. The SNO compatibility check applies only to e2e tests, not unit tests with mocked APIs.
Topology-Aware Scheduling Compatibility ✅ Passed No topology-aware scheduling concerns. Changes are internal cluster state transition logic without any pod scheduling, deployment manifests, or operator code that would affect cluster topology.
Ote Binary Stdout Contract ✅ Passed This PR modifies the OpenShift Assisted Service (a REST API), not an OTE test binary. Changes are in internal cluster transition logic with no process-level stdout writes.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Not applicable. This PR adds unit tests only (internal/cluster/transition_test.go using gomock/gorm), not Ginkgo e2e tests. Check applies only to e2e tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@carbonin
Copy link
Copy Markdown
Member Author

carbonin commented May 8, 2026

Adding hold as I haven't tested this in a live environment yet.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/cluster/transition_test.go (1)

6183-6234: ⚡ Quick win

Strengthen the negative-path test by asserting both fields remain unchanged.

Right now this case starts with NULL progress fields and only conditionally checks progress_finalizing_stage_started_at, so regressions on progress_finalizing_stage_timed_out (or field clearing/rewrite behavior) can slip through.

💡 Suggested test hardening
 It("should NOT reset finalizing stage fields on transition from Installing to Finalizing", func() {
+	oldTimestamp := time.Now().Add(-2 * time.Hour)
+	oldTimedOut := true
 	cluster := common.Cluster{
 		Cluster: models.Cluster{
 			ID:               &clusterId,
 			Status:           swag.String(models.ClusterStatusInstalling),
@@
 	}
 	Expect(db.Create(&cluster).Error).ShouldNot(HaveOccurred())
+	Expect(db.Model(&common.Cluster{}).Where("id = ?", clusterId.String()).Updates(map[string]interface{}{
+		"progress_finalizing_stage_started_at": oldTimestamp,
+		"progress_finalizing_stage_timed_out":  oldTimedOut,
+	}).Error).ShouldNot(HaveOccurred())

@@
-	var progressStartedAt *time.Time
-	var progressTimedOut *bool
+	var progressStartedAt time.Time
+	var progressTimedOut bool
 	row := db.Raw("SELECT progress_finalizing_stage_started_at, progress_finalizing_stage_timed_out FROM clusters WHERE id = ?", clusterId.String()).Row()
 	err = row.Scan(&progressStartedAt, &progressTimedOut)
 	Expect(err).ShouldNot(HaveOccurred())

-	if progressStartedAt != nil {
-		Expect(*progressStartedAt).NotTo(BeTemporally("~", time.Now(), 5*time.Second))
-	}
+	Expect(progressStartedAt).To(BeTemporally("~", oldTimestamp, time.Second))
+	Expect(progressTimedOut).To(BeTrue())
 })
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/cluster/transition_test.go` around lines 6183 - 6234, The test only
verifies progress_finalizing_stage_started_at conditionally; capture the initial
values of both progress_finalizing_stage_started_at and
progress_finalizing_stage_timed_out from the clusters table before calling
clusterApi.RefreshStatus (use the same DB query pattern used later), then after
calling getClusterFromDB and clusterApi.RefreshStatus re-query those two fields
and assert they are equal to the initial values (i.e. both remain nil or
unchanged). Reference progress_finalizing_stage_started_at,
progress_finalizing_stage_timed_out, getClusterFromDB, and
clusterApi.RefreshStatus when applying this fix.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/cluster/transition_test.go`:
- Around line 6183-6234: The test only verifies
progress_finalizing_stage_started_at conditionally; capture the initial values
of both progress_finalizing_stage_started_at and
progress_finalizing_stage_timed_out from the clusters table before calling
clusterApi.RefreshStatus (use the same DB query pattern used later), then after
calling getClusterFromDB and clusterApi.RefreshStatus re-query those two fields
and assert they are equal to the initial values (i.e. both remain nil or
unchanged). Reference progress_finalizing_stage_started_at,
progress_finalizing_stage_timed_out, getClusterFromDB, and
clusterApi.RefreshStatus when applying this fix.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f632f8b5-15d3-4cb7-942e-1190bd4ba620

📥 Commits

Reviewing files that changed from the base of the PR and between eb2dc70 and 9685baf.

📒 Files selected for processing (2)
  • internal/cluster/transition.go
  • internal/cluster/transition_test.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant