feat: warmpool support for OpenShell sandboxes

## Problem Statement

OpenShell sandboxes have a 5–15s cold start time due to sequential pod scheduling, workspace PVC seeding, supervisor initialization, and gateway registration. This makes OpenShell unsuitable for latency-sensitive workloads where users expect near-instant sandbox availability. Pre-warming sandboxes before a user request arrives would eliminate the dominant cold start costs, but the current architecture tightly couples sandbox identity to pod creation time, making external warmpool implementations impossible without gateway changes.

## Technical Context

OpenShell already uses `agents.x-k8s.io/v1alpha1 Sandbox` as its Kubernetes backend — the higher-level `SandboxWarmPool`/`SandboxTemplate`/`SandboxClaim` resources from the same API group are a natural external interface, but OpenShell has no awareness of them today.

Two structural properties of the current design prevent external warmpool integration:

1. **Static identity binding** — `OPENSHELL_SANDBOX_ID` is injected as a pod env var at creation time (`driver.rs:1413-1415`). It cannot be changed in a running pod. A warm pod must have its `sandbox_id` assigned at boot, which means the gateway must know about it before the pod starts.

2. **Gateway store gate** — on boot, the supervisor calls `ConnectSupervisor` with its `sandbox_id`. The gateway rejects any stream whose ID is not in its persistent store (`supervisor_session.rs:572`). Without a pre-existing store record, the supervisor is permanently stuck in `PROVISIONING`.

Both blockers mean an external controller creating `SandboxWarmPool`/`SandboxTemplate` objects cannot produce a usable warm sandbox without calling OpenShell's `CreateSandbox` API — which immediately schedules pod creation, defeating the purpose.

The good news: the per-user dynamic work (policy and credentials) is already hot-reloadable. The supervisor polls for config updates (`lib.rs` poll loop); provider credentials are live-updated. Only the pre-allocation and claim mechanism is missing.

## Affected Components

| Component | Key Files | Role |
|-----------|-----------|------|
| Gateway compute layer | `crates/openshell-server/src/compute/mod.rs` | Manages sandbox lifecycle, phases, supervisor sessions |
| Kubernetes driver | `crates/openshell-driver-kubernetes/src/driver.rs` | Creates `agents.x-k8s.io` Sandbox objects with static env |
| Supervisor session handler | `crates/openshell-server/src/supervisor_session.rs` | Validates and registers supervisor connections |
| Sandbox supervisor | `crates/openshell-sandbox/src/lib.rs` | Startup sequence, policy poll loop, ConnectSupervisor |
| Persistence layer | `crates/openshell-server/src/persistence/` | Source of truth for sandbox records |

## Technical Investigation

### Cold start breakdown

| Phase | Estimate | Pre-warmable? |
|-------|----------|---------------|
| Pod scheduling | 1–3s | Yes |
| Image pull (no cache) | 0–60s | Yes (cached images) |
| Init container: `copy-self` | 0.5–2s | Yes |
| Init container: workspace PVC seed | 1–10s | Yes — biggest win |
| Supervisor boot: netns + proxy + SSH | 1–4s | Yes |
| `ConnectSupervisor` handshake | 0.2–1s | Yes |
| Policy hot-reload (per user) | ~0.5s | No — user-specific |
| Credential injection | ~0s | No — already live |

### What would need to change

**Gateway:** introduce a `warm_idle` sandbox state — pre-allocated records not visible to users via `ListSandboxes`. A pool reconciler maintains N warm slots per template. On `CreateSandbox`, gateway claims the oldest ready warm slot instead of creating a new pod.

**Kubernetes driver:** pre-creation path that allocates a `sandbox_id`, inserts the store record, and creates the pod — without a user request triggering it.

**Supervisor:** no changes needed. It already supports hot-reload of policy and live credential updates.

**New RPC (optional):** `ClaimSandbox(warm_sandbox_id, user_metadata, policy)` — or reuse `UpdateConfig` on an existing READY sandbox.

### Relationship to `agents.x-k8s.io` SandboxWarmPool

`SandboxWarmPool` + `SandboxTemplate` + `SandboxClaim` could serve as the external Kubernetes API surface for this feature (the pool controller lives outside OpenShell, drives the pre-allocation via `CreateSandbox` with a `warm` flag). Alternatively, OpenShell implements the pool natively. This is a key design decision for human review.

## Proposed Approach

Introduce a `WarmPool` configuration object in the gateway store (alongside the existing TOML config from RFC 0003). Each pool entry specifies a target replica count, a sandbox template reference (image, resource profile, base policy), and an update strategy (`Recreate` or `OnReplenish`). This allows operators to resize pools at runtime without a gateway restart.

The gateway runs a pool reconciler loop that maintains the target number of `warm_idle` sandbox records and their corresponding pods. Warm slots are pre-created with a gateway-assigned `sandbox_id` and a base policy; the supervisor boots and registers normally via `ConnectSupervisor`, but the slot is not visible to users via `ListSandboxes` until claimed.

On `CreateSandbox`, the gateway first attempts to claim a matching warm slot — matching on template reference (image, resource profile). If a slot is available, it hot-reloads the user-specific policy onto the already-running supervisor before flipping visibility to the requesting user. If no warm slot is available, it falls back to a cold create. The pool reconciler replenishes the claimed slot asynchronously.

Warm slots count against a configurable pool-level resource budget, separate from per-user quota. Unclaimed slots expire after a configurable TTL to prevent policy drift — on expiry the slot is deleted and the reconciler creates a fresh replacement.

The `agents.x-k8s.io` `SandboxWarmPool`/`SandboxClaim` CRDs are a natural external API surface for this feature on Kubernetes deployments, but whether OpenShell implements native pool management or exposes hooks for an external controller remains an open design decision.

## Scope Assessment

- **Complexity:** High
- **Confidence:** Medium — the hot-reload path is proven; the pre-allocation and claim mechanism needs design
- **Estimated files to change:** 8–12
- **Issue type:** `feat`

## Configuration Surface — Open Questions

Warmpool is only operationally useful if it is configurable without redeploying the gateway. The following questions need answers before implementation:

- **Where does pool configuration live?** The gateway already has a TOML config file (RFC 0003). Is pool size / strategy a static gateway-level setting in TOML, a dynamic API resource (CRD or gateway object), or a parameter on `CreateSandbox`? Static TOML is simpler but requires a gateway restart to resize the pool.

- **What is the configuration granularity?** Global per-gateway, per-namespace, per-sandbox-template, or per-provider? A global pool cannot serve heterogeneous workloads (different images, policies, resource profiles) without over-provisioning.

- **What update strategy is needed?** The `agents.x-k8s.io` `SandboxWarmPool` spec defines `Recreate` (stale pods deleted immediately) and `OnReplenish` (replaced only when adopted or manually deleted). Does OpenShell need both, or is one sufficient for the initial implementation?

- **How are warm slots counted against quota?** Warm sandboxes consume real cluster resources. Do they count against a namespace or user quota before they are claimed? Who "owns" an unclaimed warm slot for billing/chargeback purposes?

- **What is the TTL for a warm slot?** A sandbox pre-warmed against a policy version N becomes stale when the default policy is updated. Should unclaimed slots expire after a configurable TTL, be evicted on policy change, or be lazily refreshed at claim time via hot-reload?

- **How is target pool size controlled at runtime?** Static config requires a gateway restart. A dynamic API (e.g., a `WarmPool` object in the gateway store) would allow operators to resize the pool without downtime — but adds API surface and reconciliation logic.

## Risks & Open Questions

- **Identity isolation between warm slots** — a warm sandbox has run init containers and may have stale workspace state. Does claiming a warm slot give a user a clean environment, or can PVC contents leak between users?
- **Policy at boot vs. at claim** — the supervisor loads policy on boot from `GetSandboxConfig`. A warm pod has a policy before it knows its future user. The claim step must hot-reload policy atomically before the user can connect. Is the window between READY and policy-reload safe?
- **External controller vs. native pool** — should warmpool be an OpenShell-native feature or driven externally via `SandboxWarmPool`/`SandboxClaim` CRDs? Native is simpler operationally; external is more composable with the broader `agents.x-k8s.io` ecosystem.
- **Non-Kubernetes drivers** — the VM driver and Docker driver have no pool concept. Does warmpool apply only to K8s deployments?

## Test Considerations

- E2e test: measure time from `CreateSandbox` call to first SSH byte with warm vs. cold path
- Unit test: gateway pool reconciler — claim selection, slot replenishment, pool exhaustion fallback to cold create
- Integration test: policy hot-reload correctness — verify user A's policy is not visible after warm slot is claimed by user B
- Existing pattern: `mise run e2e` covers the sandbox lifecycle; warmpool tests should extend that harness

---
*Created by spike investigation. Use `build-from-issue` to plan and implement.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: warmpool support for OpenShell sandboxes #1447

Problem Statement

Technical Context

Affected Components

Technical Investigation

Cold start breakdown

What would need to change

Relationship to `agents.x-k8s.io` SandboxWarmPool

Proposed Approach

Scope Assessment

Configuration Surface — Open Questions

Risks & Open Questions

Test Considerations

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Component	Key Files	Role
Gateway compute layer	`crates/openshell-server/src/compute/mod.rs`	Manages sandbox lifecycle, phases, supervisor sessions
Kubernetes driver	`crates/openshell-driver-kubernetes/src/driver.rs`	Creates `agents.x-k8s.io` Sandbox objects with static env
Supervisor session handler	`crates/openshell-server/src/supervisor_session.rs`	Validates and registers supervisor connections
Sandbox supervisor	`crates/openshell-sandbox/src/lib.rs`	Startup sequence, policy poll loop, ConnectSupervisor
Persistence layer	`crates/openshell-server/src/persistence/`	Source of truth for sandbox records

Phase	Estimate	Pre-warmable?
Pod scheduling	1–3s	Yes
Image pull (no cache)	0–60s	Yes (cached images)
Init container: `copy-self`	0.5–2s	Yes
Init container: workspace PVC seed	1–10s	Yes — biggest win
Supervisor boot: netns + proxy + SSH	1–4s	Yes
`ConnectSupervisor` handshake	0.2–1s	Yes
Policy hot-reload (per user)	~0.5s	No — user-specific
Credential injection	~0s	No — already live

feat: warmpool support for OpenShell sandboxes #1447

Description

Problem Statement

Technical Context

Affected Components

Technical Investigation

Cold start breakdown

What would need to change

Relationship to agents.x-k8s.io SandboxWarmPool

Proposed Approach

Scope Assessment

Configuration Surface — Open Questions

Risks & Open Questions

Test Considerations

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Relationship to `agents.x-k8s.io` SandboxWarmPool