Skip to content

feat: warmpool support for OpenShell sandboxes #1447

@pPrecel

Description

@pPrecel

Problem Statement

OpenShell sandboxes have a 5–15s cold start time due to sequential pod scheduling, workspace PVC seeding, supervisor initialization, and gateway registration. This makes OpenShell unsuitable for latency-sensitive workloads where users expect near-instant sandbox availability. Pre-warming sandboxes before a user request arrives would eliminate the dominant cold start costs, but the current architecture tightly couples sandbox identity to pod creation time, making external warmpool implementations impossible without gateway changes.

Technical Context

OpenShell already uses agents.x-k8s.io/v1alpha1 Sandbox as its Kubernetes backend — the higher-level SandboxWarmPool/SandboxTemplate/SandboxClaim resources from the same API group are a natural external interface, but OpenShell has no awareness of them today.

Two structural properties of the current design prevent external warmpool integration:

  1. Static identity bindingOPENSHELL_SANDBOX_ID is injected as a pod env var at creation time (driver.rs:1413-1415). It cannot be changed in a running pod. A warm pod must have its sandbox_id assigned at boot, which means the gateway must know about it before the pod starts.

  2. Gateway store gate — on boot, the supervisor calls ConnectSupervisor with its sandbox_id. The gateway rejects any stream whose ID is not in its persistent store (supervisor_session.rs:572). Without a pre-existing store record, the supervisor is permanently stuck in PROVISIONING.

Both blockers mean an external controller creating SandboxWarmPool/SandboxTemplate objects cannot produce a usable warm sandbox without calling OpenShell's CreateSandbox API — which immediately schedules pod creation, defeating the purpose.

The good news: the per-user dynamic work (policy and credentials) is already hot-reloadable. The supervisor polls for config updates (lib.rs poll loop); provider credentials are live-updated. Only the pre-allocation and claim mechanism is missing.

Affected Components

Component Key Files Role
Gateway compute layer crates/openshell-server/src/compute/mod.rs Manages sandbox lifecycle, phases, supervisor sessions
Kubernetes driver crates/openshell-driver-kubernetes/src/driver.rs Creates agents.x-k8s.io Sandbox objects with static env
Supervisor session handler crates/openshell-server/src/supervisor_session.rs Validates and registers supervisor connections
Sandbox supervisor crates/openshell-sandbox/src/lib.rs Startup sequence, policy poll loop, ConnectSupervisor
Persistence layer crates/openshell-server/src/persistence/ Source of truth for sandbox records

Technical Investigation

Cold start breakdown

Phase Estimate Pre-warmable?
Pod scheduling 1–3s Yes
Image pull (no cache) 0–60s Yes (cached images)
Init container: copy-self 0.5–2s Yes
Init container: workspace PVC seed 1–10s Yes — biggest win
Supervisor boot: netns + proxy + SSH 1–4s Yes
ConnectSupervisor handshake 0.2–1s Yes
Policy hot-reload (per user) ~0.5s No — user-specific
Credential injection ~0s No — already live

What would need to change

Gateway: introduce a warm_idle sandbox state — pre-allocated records not visible to users via ListSandboxes. A pool reconciler maintains N warm slots per template. On CreateSandbox, gateway claims the oldest ready warm slot instead of creating a new pod.

Kubernetes driver: pre-creation path that allocates a sandbox_id, inserts the store record, and creates the pod — without a user request triggering it.

Supervisor: no changes needed. It already supports hot-reload of policy and live credential updates.

New RPC (optional): ClaimSandbox(warm_sandbox_id, user_metadata, policy) — or reuse UpdateConfig on an existing READY sandbox.

Relationship to agents.x-k8s.io SandboxWarmPool

SandboxWarmPool + SandboxTemplate + SandboxClaim could serve as the external Kubernetes API surface for this feature (the pool controller lives outside OpenShell, drives the pre-allocation via CreateSandbox with a warm flag). Alternatively, OpenShell implements the pool natively. This is a key design decision for human review.

Proposed Approach

Introduce a WarmPool configuration object in the gateway store (alongside the existing TOML config from RFC 0003). Each pool entry specifies a target replica count, a sandbox template reference (image, resource profile, base policy), and an update strategy (Recreate or OnReplenish). This allows operators to resize pools at runtime without a gateway restart.

The gateway runs a pool reconciler loop that maintains the target number of warm_idle sandbox records and their corresponding pods. Warm slots are pre-created with a gateway-assigned sandbox_id and a base policy; the supervisor boots and registers normally via ConnectSupervisor, but the slot is not visible to users via ListSandboxes until claimed.

On CreateSandbox, the gateway first attempts to claim a matching warm slot — matching on template reference (image, resource profile). If a slot is available, it hot-reloads the user-specific policy onto the already-running supervisor before flipping visibility to the requesting user. If no warm slot is available, it falls back to a cold create. The pool reconciler replenishes the claimed slot asynchronously.

Warm slots count against a configurable pool-level resource budget, separate from per-user quota. Unclaimed slots expire after a configurable TTL to prevent policy drift — on expiry the slot is deleted and the reconciler creates a fresh replacement.

The agents.x-k8s.io SandboxWarmPool/SandboxClaim CRDs are a natural external API surface for this feature on Kubernetes deployments, but whether OpenShell implements native pool management or exposes hooks for an external controller remains an open design decision.

Scope Assessment

  • Complexity: High
  • Confidence: Medium — the hot-reload path is proven; the pre-allocation and claim mechanism needs design
  • Estimated files to change: 8–12
  • Issue type: feat

Configuration Surface — Open Questions

Warmpool is only operationally useful if it is configurable without redeploying the gateway. The following questions need answers before implementation:

  • Where does pool configuration live? The gateway already has a TOML config file (RFC 0003). Is pool size / strategy a static gateway-level setting in TOML, a dynamic API resource (CRD or gateway object), or a parameter on CreateSandbox? Static TOML is simpler but requires a gateway restart to resize the pool.

  • What is the configuration granularity? Global per-gateway, per-namespace, per-sandbox-template, or per-provider? A global pool cannot serve heterogeneous workloads (different images, policies, resource profiles) without over-provisioning.

  • What update strategy is needed? The agents.x-k8s.io SandboxWarmPool spec defines Recreate (stale pods deleted immediately) and OnReplenish (replaced only when adopted or manually deleted). Does OpenShell need both, or is one sufficient for the initial implementation?

  • How are warm slots counted against quota? Warm sandboxes consume real cluster resources. Do they count against a namespace or user quota before they are claimed? Who "owns" an unclaimed warm slot for billing/chargeback purposes?

  • What is the TTL for a warm slot? A sandbox pre-warmed against a policy version N becomes stale when the default policy is updated. Should unclaimed slots expire after a configurable TTL, be evicted on policy change, or be lazily refreshed at claim time via hot-reload?

  • How is target pool size controlled at runtime? Static config requires a gateway restart. A dynamic API (e.g., a WarmPool object in the gateway store) would allow operators to resize the pool without downtime — but adds API surface and reconciliation logic.

Risks & Open Questions

  • Identity isolation between warm slots — a warm sandbox has run init containers and may have stale workspace state. Does claiming a warm slot give a user a clean environment, or can PVC contents leak between users?
  • Policy at boot vs. at claim — the supervisor loads policy on boot from GetSandboxConfig. A warm pod has a policy before it knows its future user. The claim step must hot-reload policy atomically before the user can connect. Is the window between READY and policy-reload safe?
  • External controller vs. native pool — should warmpool be an OpenShell-native feature or driven externally via SandboxWarmPool/SandboxClaim CRDs? Native is simpler operationally; external is more composable with the broader agents.x-k8s.io ecosystem.
  • Non-Kubernetes drivers — the VM driver and Docker driver have no pool concept. Does warmpool apply only to K8s deployments?

Test Considerations

  • E2e test: measure time from CreateSandbox call to first SSH byte with warm vs. cold path
  • Unit test: gateway pool reconciler — claim selection, slot replenishment, pool exhaustion fallback to cold create
  • Integration test: policy hot-reload correctness — verify user A's policy is not visible after warm slot is claimed by user B
  • Existing pattern: mise run e2e covers the sandbox lifecycle; warmpool tests should extend that harness

Created by spike investigation. Use build-from-issue to plan and implement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions