MGMT-24352: Reset finalizing timeout on transition from installing-pending-user-action#10293
Conversation
…ction When a non-essential worker gets stuck in installing-pending-user-action the rest of the cluster can still progress to finalizing. The cluster-level Done finalization timeout (70 min) can then expire before the per-host timeout chain completes (40 min Rebooting + 60 min pending-user-action = 100 min). In this situation, when the stuck host moves from `installing-pending-user-action` to either `error` or `installing-in-progress` the entire cluster will fail instead of evicting the stuck host and completing with the remaining nodes. This commit changesthe cluster state transition so it will reset progress_finalizing_stage_started_at and progress_finalizing_stage_timed_out when transitioning FROM InstallingPendingUserAction TO Finalizing. This gives the cluster a fresh 70-minute timeout window, allowing sufficient time for the host to either finish installation (if the boot issue was fixed) or error (in which case the cluster will succeed without it). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Resolves https://redhat.atlassian.net/browse/MGMT-24352
WalkthroughThe PR adds logic to reset finalizing-stage timeout tracking when a cluster transitions from ChangesFinalizing-Stage Timeout Reset on Transition
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Adding hold as I haven't tested this in a live environment yet. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/cluster/transition_test.go (1)
6183-6234: ⚡ Quick winStrengthen the negative-path test by asserting both fields remain unchanged.
Right now this case starts with NULL progress fields and only conditionally checks
progress_finalizing_stage_started_at, so regressions onprogress_finalizing_stage_timed_out(or field clearing/rewrite behavior) can slip through.💡 Suggested test hardening
It("should NOT reset finalizing stage fields on transition from Installing to Finalizing", func() { + oldTimestamp := time.Now().Add(-2 * time.Hour) + oldTimedOut := true cluster := common.Cluster{ Cluster: models.Cluster{ ID: &clusterId, Status: swag.String(models.ClusterStatusInstalling), @@ } Expect(db.Create(&cluster).Error).ShouldNot(HaveOccurred()) + Expect(db.Model(&common.Cluster{}).Where("id = ?", clusterId.String()).Updates(map[string]interface{}{ + "progress_finalizing_stage_started_at": oldTimestamp, + "progress_finalizing_stage_timed_out": oldTimedOut, + }).Error).ShouldNot(HaveOccurred()) @@ - var progressStartedAt *time.Time - var progressTimedOut *bool + var progressStartedAt time.Time + var progressTimedOut bool row := db.Raw("SELECT progress_finalizing_stage_started_at, progress_finalizing_stage_timed_out FROM clusters WHERE id = ?", clusterId.String()).Row() err = row.Scan(&progressStartedAt, &progressTimedOut) Expect(err).ShouldNot(HaveOccurred()) - if progressStartedAt != nil { - Expect(*progressStartedAt).NotTo(BeTemporally("~", time.Now(), 5*time.Second)) - } + Expect(progressStartedAt).To(BeTemporally("~", oldTimestamp, time.Second)) + Expect(progressTimedOut).To(BeTrue()) })🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/cluster/transition_test.go` around lines 6183 - 6234, The test only verifies progress_finalizing_stage_started_at conditionally; capture the initial values of both progress_finalizing_stage_started_at and progress_finalizing_stage_timed_out from the clusters table before calling clusterApi.RefreshStatus (use the same DB query pattern used later), then after calling getClusterFromDB and clusterApi.RefreshStatus re-query those two fields and assert they are equal to the initial values (i.e. both remain nil or unchanged). Reference progress_finalizing_stage_started_at, progress_finalizing_stage_timed_out, getClusterFromDB, and clusterApi.RefreshStatus when applying this fix.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@internal/cluster/transition_test.go`:
- Around line 6183-6234: The test only verifies
progress_finalizing_stage_started_at conditionally; capture the initial values
of both progress_finalizing_stage_started_at and
progress_finalizing_stage_timed_out from the clusters table before calling
clusterApi.RefreshStatus (use the same DB query pattern used later), then after
calling getClusterFromDB and clusterApi.RefreshStatus re-query those two fields
and assert they are equal to the initial values (i.e. both remain nil or
unchanged). Reference progress_finalizing_stage_started_at,
progress_finalizing_stage_timed_out, getClusterFromDB, and
clusterApi.RefreshStatus when applying this fix.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f632f8b5-15d3-4cb7-942e-1190bd4ba620
📒 Files selected for processing (2)
internal/cluster/transition.gointernal/cluster/transition_test.go
When a non-essential worker gets stuck in installing-pending-user-action the rest of the cluster can still progress to finalizing. The cluster-level Done finalization timeout (70 min) can then expire before the per-host timeout chain completes (40 min Rebooting + 60 min pending-user-action = 100 min). In this situation, when the stuck host moves from
installing-pending-user-actionto eithererrororinstalling-in-progressthe entire cluster will fail instead of evicting the stuck host and completing with the remaining nodes.This commit changesthe cluster state transition so it will reset progress_finalizing_stage_started_at and progress_finalizing_stage_timed_out when transitioning FROM InstallingPendingUserAction TO Finalizing. This gives the cluster a fresh 70-minute timeout window, allowing sufficient time for the host to either finish installation (if the boot issue was fixed) or error (in which case the cluster will succeed without it).
List all the issues related to this PR
Resolves https://redhat.atlassian.net/browse/MGMT-24352
What environments does this code impact?
How was this code tested?
WIP - will test manually
Checklist
docs, README, etc)Reviewers Checklist
Summary by CodeRabbit
Release Notes