Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/HostStatus.dot
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ digraph HostStateMachine {
"installing-in-progress" -> "installing-pending-user-action" [label = "wrong boot\norder", color=cornsilk4, fontcolor=cornsilk4];

"installing-pending-user-action" -> installed [label = "boot from\ninstall disk", color=chartreuse4, fontcolor=chartreuse4];
"installing-pending-user-action" -> error [label = "timeout", color=crimson, fontcolor=crimson];

cancelled -> resetting [label = "reset", color=aquamarine4, fontcolor=aquamarine4];

Expand Down
Binary file modified docs/HostStatus.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ Each cluster and each host being installed moves through their respective state
* Preparing-for-installation: The service runs openshift-install create ignition-configs and uploads all files to S3. If the user chose to use route53 for DNS, the service creates those record sets.
* Installing: The service is ready to begin the cluster installation. Next time the agent asks for instructions, the service will instruct it to begin the installation, and then moves the state to installing-in-progress.
* Installing-in-progress: The host is currently installing.
* Installing-pending-user-action: If the service expected the host to reboot and boot from disk, but the agent came up again and contacted the service, the host enters this state to notify the user to fix the server’s boot order.
* Installing-pending-user-action: If the service expected the host to reboot and boot from disk, but the agent came up again and contacted the service, the host enters this state to notify the user to fix the server’s boot order. The host will time out and move to error if the cluster would be successful without the host installing.
* Resetting: If the user requested to reset the installation, the host enters this transient state while the service resets.
* Resetting-pending-user-action: To reset the installation, the host needs to be booted from the live image. If the host already booted from disk in a previous installation, the host enters this state to notify the user to boot from the live image.
* Installed: The installation has successfully completed on the host.
Expand Down
316 changes: 316 additions & 0 deletions docs/installing-pending-user-action-timeout-behavior.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,316 @@
# Timeout Behavior for Hosts in "installing-pending-user-action" State

## Overview

This document explains how hosts reach the "installing-pending-user-action" state and describes the timeout behavior for both hosts and clusters when hosts are in this state.

## How Hosts Reach "installing-pending-user-action"

Hosts can transition to the `installing-pending-user-action` state in two scenarios:

### 1. Wrong Boot Order Detection (Most Common)

**Trigger:** A host boots from the discovery ISO after it should have booted from the installed operating system.

**When it happens:**
- The host has already started installation and reached the rebooting stage
- After the reboot, the host boots from the discovery ISO instead of the installed OS disk
- The discovery ISO immediately tries to register with the service
- The service detects this is wrong and moves the host to `installing-pending-user-action`

**State machine transitions:**
- **Day-1 hosts:** From `installing-in-progress` or `installing` → `installing-pending-user-action`
- Condition: Host is in rebooting stage AND registers (indicating wrong boot device)
- Code: `internal/host/statemachine.go:189-201`

- **Day-2 hosts:** From `installing-in-progress` or `added-to-existing-cluster` → `installing-pending-user-action`
- Condition: Host is in done stage AND registers
- Code: `internal/host/statemachine.go:189-201`

**User-facing message:**
```
Expected the host to boot from disk, but it booted the installation image - please
reboot and fix boot order to boot from disk [Model] [Serial] (disk-name, disk-id)
```

Where `[Model]` and `[Serial]` are the installation disk's model and serial number (if available), and `disk-name` and `disk-id` identify the specific disk.

**Root causes:**
- BIOS/UEFI boot order configured incorrectly
- Installation disk not set as first boot device
- Virtual media still mounted with higher priority than installation disk
- User manually selected discovery ISO during boot

### 2. Reboot Timeout

**Trigger:** A host takes too long to reboot and doesn't make progress in the rebooting stage.

**When it happens:**
- Host is in `installing-in-progress` state
- Host is in the `rebooting` stage
- The stage timeout expires before the host completes the reboot

**Timeouts:**
- **Multi-node clusters:** 40 minutes (default)
- Configurable via `HOST_STAGE_REBOOTING_TIMEOUT` environment variable
- **Single Node OpenShift (SNO):** 80 minutes (hardcoded)
- Extended timeout due to additional bootstrap responsibilities

**State machine transition:**
- From: `installing-in-progress` (while in rebooting stage)
- To: `installing-pending-user-action`
- Condition: Stage timeout expired AND host is in rebooting stage
- Code: `internal/host/statemachine.go:921-933`
Comment thread
rccrdpccl marked this conversation as resolved.

**User-facing message:**
```
Host timed out when pulling the configuration files. Verify in the host console
that the host boots from the OpenShift installation disk (disk-name, disk-id) and
has network access to the cluster API. The installation will resume after the host
successfully boots and can access the cluster API
```

Where `disk-name` and `disk-id` identify the installation disk (e.g., `(sda, /dev/disk/by-id/...)`)

**Root causes:**
- Host hardware taking unusually long to reboot
- Host stuck in BIOS/firmware
- Network connectivity issues preventing host from reporting progress
- Actual wrong boot order (boots to ISO but doesn't register immediately)

### Key Differences Between the Two Scenarios

| Aspect | Wrong Boot Order Detection | Reboot Timeout |
|--------|----------------------------|----------------|
| **Trigger** | Host actively registers while in rebooting stage | Rebooting stage timeout expires |
| **Detection** | Immediate (on registration) | After 40min (multi-node) or 80min (SNO) |
| **Message keyword** | "booted the installation image" | "timed out when pulling configuration" |
| **Certainty** | Definitely wrong boot order | Could be timeout OR wrong boot order |
| **Code path** | `PostRegisterDuringReboot` | `PostRefreshHost` with `statusRebootTimeout` |
| **Event handler** | `TransitionTypeRegisterHost` | `TransitionTypeRefresh` |

**Important:** Both scenarios result in the same state (`installing-pending-user-action`), but the messages differ to help users understand what happened. The "installation image" message is more specific because the service detected the host actually booted from the discovery ISO, while the timeout message is more ambiguous since the service doesn't know for certain what happened.

## Host-Level Timeout Behavior

### While in "installing-pending-user-action"

**Key point:** Hosts in `installing-pending-user-action` state will now timeout after a configurable duration.

**Timeout Configuration:**
- **Default:** 60 minutes
- **Configurable via:** `HOST_INSTALLING_PENDING_USER_ACTION_TIMEOUT` environment variable
- **Measured from:** When the host entered `installing-pending-user-action` state (using `status_updated_at` field)

**Behavior:**
- The host remains in `installing-pending-user-action` waiting for user intervention
- After the timeout expires, the host transitions to `error` state
- The host can recover before timeout by booting from the correct disk

**Recovery path (before timeout):**
1. User fixes the boot order or manually boots from installation disk
2. Host boots into the installed OS
3. Host begins pulling configuration and continues installation
4. Host reports progress via its stage
5. Host transitions back to `installing-in-progress` or progresses to `installed`

**Timeout behavior:**
- After timeout expires: host transitions to `error` state **only if cluster can succeed without it**
- Cluster viability check ensures critical hosts (e.g., masters in 3-node cluster) don't timeout
- If cluster needs this host to succeed, host stays in `installing-pending-user-action` (cluster 24h timeout applies)
- Error message: "Host failed to boot from the installation disk within the expected time. The host had wrong boot order and did not recover. To include this host, reset the cluster and fix the boot order, or continue installation with remaining hosts"
- Code: `internal/host/statemachine.go` (transition with cluster viability check)

**State machine behavior:**
- Refresh transitions keep the host in `installing-pending-user-action` until timeout expires
- After timeout: transition to `error` via `HasPendingUserActionTimedOut` condition
- HostProgress transitions can move the host out of this state when progress is detected (before timeout)

## Cluster-Level Timeout Behavior

### Transition INTO "installing-pending-user-action"

**Trigger:** At least one host in the cluster enters `installing-pending-user-action`

**Conditions:**
- Cluster is in `installing` or `finalizing` state
- At least one host has status `installing-pending-user-action`
- The cluster still has enough hosts in installing/installed states to potentially succeed

**State machine transition:**
- From: `installing` or `finalizing`
- To: `installing-pending-user-action`
- Condition: `IsInstallingPendingUserAction` returns true (checks if any host has this status)
- Code: `internal/cluster/statemachine.go:457-471`

**Status message:** "Cluster has hosts pending user action"

### Timeout While Cluster is in "installing-pending-user-action"

**Installation Timeout:**
- **Default:** 24 hours
- **Configurable via:** `INSTALLATION_TIMEOUT` environment variable
- **Start time:** When installation began (not when cluster entered pending state)

**When timeout expires:**
- Cluster transitions to `error` state
- Error message: "cluster installation timed out while pending user action (a manual booting from installation disk)"
- Code: `internal/cluster/statemachine.go:320-332`

**Important:** The timeout is measured from the original installation start time, NOT from when the cluster entered `installing-pending-user-action`. This means:
- If a cluster enters pending state after 23 hours of installation, it only has 1 hour to recover
- The timeout is checking overall installation duration, not time spent in pending state

### Other Timeout Scenarios

While in `installing-pending-user-action`, the cluster can also transition to error if:

1. **Not enough hosts to proceed:**
- Condition: Cluster no longer has enough hosts in installing/finalizing states
- Result: Move to `error` with message "cluster has hosts in error"
- Code: `internal/cluster/statemachine.go:301-318`

### Recovery from "installing-pending-user-action"

The cluster can transition OUT of `installing-pending-user-action` when hosts recover:

**Back to Installing:**
- Condition: No hosts are pending user action AND enough hosts are still installing (but not yet finalizing)
- Transition: `installing-pending-user-action` → `installing`
- Message: "Installation in progress"
- Code: `internal/cluster/statemachine.go:421-437`

**To Finalizing:**
- Condition: No hosts are pending user action AND enough masters and workers are installed
- Transition: `installing-pending-user-action` → `finalizing`
- Message: "Finalizing cluster installation"
- Code: `internal/cluster/statemachine.go:439-454`

**Stay in Pending (no-op):**
- Condition: Still have hosts pending user action, but enough hosts are installing or finalizing
- Transition: `installing-pending-user-action` → `installing-pending-user-action`
- Code: `internal/cluster/statemachine.go:404-419`

### Cluster Success with Timed-Out Hosts

**New Behavior:** When hosts timeout from `installing-pending-user-action`, the system intelligently decides whether to move them to `error` based on cluster viability.

**How It Works:**

1. **Host Timeout with Viability Check:**
- Host in `installing-pending-user-action` exceeds `HOST_INSTALLING_PENDING_USER_ACTION_TIMEOUT` (default: 60 minutes)
- System checks: "Would the cluster still succeed if this host goes to error?"
- **If YES** (cluster has enough remaining hosts): Host transitions to `error` state
- **If NO** (cluster needs this host): Host stays in `installing-pending-user-action`, giving user more time to fix
- This prevents timing out critical hosts (e.g., masters in 3-node cluster)

2. **Cluster Re-evaluation:**
- Cluster checks if any hosts are still in `installing-pending-user-action`
- If no hosts pending: cluster exits `installing-pending-user-action` state
- Cluster checks if it has enough hosts to succeed (via `IsInstalling` or `IsFinalizing`)

3. **Sufficient Hosts Check:**
- Hosts in `error` state are NOT counted toward minimum requirements
- **For Workers:** Cluster needs minimum 2 workers in good states (installing, installed, etc.)
- Example: 5-worker cluster with 2 workers in error → succeeds (3 healthy ≥ 2 minimum)
- **For Masters:** Cluster needs ALL expected masters in good states
- Example: 3-master cluster with 1 master in error → FAILS (2 healthy < 3 required)

4. **Cluster Progression:**
- If sufficient hosts: cluster moves to `installing` or `finalizing` → `installed`
- If insufficient hosts: cluster moves to `error`

**Practical Examples:**

**Scenario 1: Worker failure - Cluster succeeds**
- 3-master + 5-worker cluster
- 2 workers stuck in `installing-pending-user-action`
- After 60 minutes: 2 workers → `error`
- Cluster has 3 masters + 3 workers (healthy)
- Result: Cluster succeeds (3 workers ≥ 2 minimum)

**Scenario 2: Master failure - Smart timeout**
- 3-master cluster (requires all 3 masters)
- 1 master stuck in `installing-pending-user-action`
- After 60 minutes: **Master stays in `installing-pending-user-action`** (viability check prevents timeout)
- Reason: Cluster needs all 3 masters; timing out would immediately fail cluster
- User has until 24-hour cluster timeout to fix boot order
- Result: Better UX - user gets more time to recover critical host

**Scenario 3: Extra workers, some fail**
- 3-master + 3-worker cluster
- 1 worker stuck in `installing-pending-user-action`
- After 60 minutes: 1 worker → `error` (cluster still has 2 workers ≥ minimum)
- Cluster has 3 masters + 2 workers (healthy)
- Result: Cluster succeeds (2 workers ≥ 2 minimum)

**Scenario 4: Insufficient workers - Smart timeout**
- 3-master + 2-worker cluster (minimum workers)
- 1 worker stuck in `installing-pending-user-action`
- After 60 minutes: **Worker stays in `installing-pending-user-action`** (viability check prevents timeout)
- Reason: Only 2 workers total; timing out would leave only 1 worker (< 2 minimum)
- Result: User gets more time to recover the critical worker

**Code References:**
- Host timeout transition: `internal/host/statemachine.go` (HasPendingUserActionTimedOut condition)
- Cluster validation: `internal/cluster/transition.go:517-554` (enoughMastersAndWorkers function)
- Host counting: `internal/cluster/common.go:192-206` (HostsInStatus function)
Comment on lines +254 to +257
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify the locations of viability check functions

# Search for enoughMastersAndWorkers function
echo "=== Searching for enoughMastersAndWorkers ==="
rg -n 'func.*enoughMastersAndWorkers' internal/

# Search for HostsInStatus function  
echo "=== Searching for HostsInStatus ==="
rg -n 'func.*HostsInStatus' internal/

# Search for HasEnoughMastersAndWorkers function
echo "=== Searching for HasEnoughMastersAndWorkers ==="
rg -n 'func.*HasEnoughMastersAndWorkers' internal/

Repository: openshift/assisted-service

Length of output: 416


🏁 Script executed:

# Check if enoughMastersAndWorkers exists in transition.go
echo "=== Checking transition.go for enoughMastersAndWorkers ==="
rg -A 5 'enoughMastersAndWorkers' internal/cluster/transition.go

# Check what's actually at lines 517-554 in transition.go
echo "=== Lines 517-554 in transition.go ==="
sed -n '517,554p' internal/cluster/transition.go | cat -n

# Check the context around HostsInStatus in common.go
echo "=== Lines 910-925 in common.go (around HostsInStatus) ==="
sed -n '910,925p' internal/common/common.go | cat -n

# Verify if transition.go calls HasEnoughMastersAndWorkers or similar
echo "=== Searching transition.go for HasEnoughMastersAndWorkers usage ==="
rg -B 2 -A 2 'HasEnoughMastersAndWorkers' internal/cluster/transition.go

Repository: openshift/assisted-service

Length of output: 3041


Correct the code reference paths for cluster viability checks.

The documentation references incorrect file paths and function names:

  • enoughMastersAndWorkers function does not exist; should reference HasEnoughMastersAndWorkers
  • HostsInStatus is located at internal/common/common.go:916, not internal/cluster/common.go:192-206
  • HasEnoughMastersAndWorkers is located at internal/common/common.go:939, not internal/cluster/transition.go:517-554

Update lines 254-257 to reference:

  • Host counting: internal/common/common.go:916 (HostsInStatus function)
  • Cluster validation: internal/common/common.go:939 (HasEnoughMastersAndWorkers function)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/installing-pending-user-action-timeout-behavior.md` around lines 254 -
257, Update the documentation references to point at the correct locations and
function names: replace the incorrect cluster validation reference to
`enoughMastersAndWorkers` with `HasEnoughMastersAndWorkers` and update its path
to `internal/common/common.go:939`, and change the host counting reference for
`HostsInStatus` to `internal/common/common.go:916`; ensure lines 254-257 mention
`HostsInStatus` at internal/common/common.go:916 and
`HasEnoughMastersAndWorkers` at internal/common/common.go:939 so the doc links
match the codebase.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rabbit has a point, the function moved :)

Code references in general are very hard to maintain, I would discourage their use, unless we find some method that would avoid the staleness issue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rccrdpccl, totally agree — line-number references in docs are notoriously fragile. A few approaches that can reduce (though not eliminate) the staleness problem:

  1. GitHub permalink to a specific commit SHA — Instead of internal/foo/bar.go:123, link to https://github.com/openshift/assisted-service/blob/<SHA>/internal/foo/bar.go#L123. The line stays frozen at that commit, so it's never silently wrong — though it does go stale in the sense that it won't track future refactors.

  2. Reference symbols/function names only, no line numbers — Prose like "see HasEnoughMastersAndWorkers in internal/common/common.go" is far more resilient than line numbers because symbol renames are usually grep-able, whereas line shifts are invisible.

  3. Automated link-rot check — A CI step (e.g., a small script) that greps for every symbol name mentioned in the doc and fails if any are missing from the codebase. This catches renames at PR time rather than months later.

For this PR, given the docs-only nature, the simplest fix would be to drop line numbers entirely and just reference function/file names. That way the references remain useful without the high maintenance cost.


📝 Learnings were identified and not saved (knowledge base disabled). Enable


## Summary Table

| Aspect | Behavior | Timeout Value | Configurable |
|--------|----------|---------------|--------------|
| **Host entering state** | Wrong boot order OR reboot timeout | N/A (event-driven) | No |
| **Host reboot timeout (multi-node)** | Triggers transition to pending state | 40 minutes | Yes (`HOST_STAGE_REBOOTING_TIMEOUT`) |
| **Host reboot timeout (SNO)** | Triggers transition to pending state | 80 minutes | No (hardcoded) |
| **Host while in pending state** | Timeout to error if cluster viable | 60 minutes | Yes (`HOST_INSTALLING_PENDING_USER_ACTION_TIMEOUT`) |
| **Host after timeout** | Transitions to error (only if cluster can succeed) | N/A (triggered by viability check) | N/A |
| **Cluster timeout** | Entire installation must complete | 24 hours (from install start) | Yes (`INSTALLATION_TIMEOUT`) |
| **Cluster recovery** | Returns to installing/finalizing when hosts recover or timeout | N/A | N/A |
| **Cluster success with failures** | Workers can fail if minimum met; masters cannot | Workers: ≥2, Masters: all required | N/A |

## Critical Implementation Details

### Host Stage Timeouts
- Defined in: `internal/host/config.go`
- Default values in `hostStageTimeoutDefaults` map
- Rebooting stage: 40 minutes (except SNO which is 80 minutes)
- All stage timeouts are configurable via `HOST_STAGE_<STAGE_NAME>_TIMEOUT` environment variables

### Cluster Installation Timeout
- Defined in: `internal/cluster/cluster.go`
- Field: `InstallationTimeout` in `Config` struct
- Default: 24 hours
- Environment variable: `INSTALLATION_TIMEOUT`
- **Critical:** Timeout is measured from `cluster.InstallStartedAt`, not from when cluster enters pending state

### Wrong Boot Order Stages
The following stages are considered "wrong boot order" stages where timeout should be ignored if the cluster is in pending state:
- `waiting-for-control-plane`
- `waiting-for-controller`
- `waiting-for-bootkube`
- `rebooting`

Defined in: `internal/host/host.go:41-46` as `WrongBootOrderIgnoreTimeoutStages`

## Related Code References

### State Machines
- Host state machine: `internal/host/statemachine.go`
- Host transitions: `internal/host/transition.go`
- Cluster state machine: `internal/cluster/statemachine.go`
- Cluster transitions: `internal/cluster/transition.go`

### Status Messages
- Host status constants: `internal/host/common.go:49` (defines `statusRebootTimeout`)
- Wrong boot order message: `internal/host/transition.go` (in `PostRegisterDuringReboot` function)
- Line ~234: "Expected the host to boot from disk, but it booted the installation image..."
- Reboot timeout message: `internal/host/common.go:49`
- Constant: `statusRebootTimeout`
- Message macro replacement: `internal/host/transition.go` (in `replaceMacros` function)
- Replaces `$INSTALLATION_DISK` with actual disk information

### Configurations
- Host timeout configurations: `internal/host/config.go`
- Cluster timeout configurations: `internal/cluster/cluster.go`
- Cluster status constants: `internal/cluster/common.go`