Xiangyue-Zhang · Xiangyue-Zhang · Apr 19, 2026 · Apr 19, 2026
diff --git a/AI_GUIDE.md b/AI_GUIDE.md
@@ -348,6 +348,33 @@ During training (90%+ of time), the agent does NOT call the LLM. It only does:
 - **Writing Agent**: generates reports (3 tools)
 - Only 1 worker active at a time, others cost $0
 
+### Tool-Use Protocol (provider-agnostic)
+
+Workers do not use each provider's native SDK tool-use protocol. Instead the
+framework injects a plain-text schema into the system prompt and the worker
+emits tool calls as `<tool_call>{...}</tool_call>` blocks. The dispatcher
+parses the blocks, runs each through `ToolRegistry.execute_tool`, and feeds
+results back as `<tool_result name="...">...</tool_result>` in the next user
+turn. The loop runs until the worker produces a response with no tool calls
+(the final answer) or `max_turns` is reached.
+
+Why this design:
+
+- **One protocol, four providers** — the Anthropic and OpenAI SDK paths use
+  the same text protocol as `claude_cli` and `codex_cli`. No per-provider
+  branching in the execution loop.
+- **Authoritative PID / log_file** — the EXECUTE → MONITOR handoff reads
+  `pid` and `log_file` directly from the `launch_experiment` tool's JSON
+  result, not from regex-scraping the model's prose.
+- **Provider-lock-down** — for `claude_cli` the framework passes
+  `--tools ""` so the CLI cannot bypass the protocol with its own built-in
+  tools. `codex_cli` has no equivalent flag and will silently ignore the
+  protocol; a runtime warning is emitted when it is used as a worker, and
+  users should pick one of the other three providers for worker dispatches.
+- **Fence stripping** — tool-call blocks inside triple-backtick code fences
+  are ignored, so a model's illustrative example in its prose is never
+  accidentally executed.
+
 ### Safety
 - Mandatory dry-run before every real training
 - Protected files can't be overwritten

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -396,6 +396,30 @@ During training (90%+ of time), the agent does NOT call the LLM. It only does:
 - **Writing Agent**: generates reports (3 tools)
 - Only 1 worker active at a time, others cost $0
 
+### Tool-Use Protocol (provider-agnostic)
+
+Workers do not use each provider's native SDK tool-use protocol. Instead the
+framework injects a plain-text schema into the system prompt and the worker
+emits tool calls as `<tool_call>{...}</tool_call>` blocks. The dispatcher
+parses the blocks, runs each through `ToolRegistry.execute_tool`, and feeds
+results back as `<tool_result name="...">...</tool_result>` in the next user
+turn. The loop runs until the worker produces a response with no tool calls
+(the final answer) or `max_turns` is reached.
+
+Key properties:
+
+- One text protocol works identically across all four providers — no
+  per-provider branching in the execution loop.
+- `launch_experiment` PID and log_file come authoritatively from the tool
+  result's JSON, not from regex-scraping the model's prose.
+- For `claude_cli` the framework passes `--tools ""` so the CLI cannot
+  bypass the protocol with its own built-in tools. `codex_cli` has no
+  equivalent flag and may silently ignore the protocol; a runtime warning
+  is emitted when it is used as a worker, and users should pick one of the
+  other three providers for worker dispatches.
+- Tool-call blocks inside triple-backtick code fences are ignored, so a
+  model's illustrative example in prose is never accidentally executed.
+
 ### Safety
 - Mandatory dry-run before every real training
 - Protected files can't be overwritten

diff --git a/README.md b/README.md
@@ -37,6 +37,14 @@
 
 ## Recent Updates
 
+**2026-04-19**
+- Workers now execute tools through a real multi-turn tool-use loop. The dispatcher injects the tool schema into the system prompt, parses `<tool_call>` blocks from the LLM response, runs each through `ToolRegistry.execute_tool`, feeds results back as `<tool_result>` in the next turn, and iterates until the worker produces a response with no tool calls or `max_turns` is hit. Previously the `tools` argument was accepted and silently dropped, and worker output was regex-scraped for PIDs — closes the gap raised in issue #13.
+- `launch_experiment` PIDs and log file paths are now surfaced directly from the tool result (authoritative), with the old free-text regex retained only as a fallback for pre-protocol responses.
+- `claude_cli` is forced into pure-text mode via `claude -p --tools ""`, so its responses reliably go through the framework's protocol.
+- `codex_cli` cannot be forced into pure-text mode by any current flag; when used as a worker provider the framework now emits a clear warning (see the updated compatibility table in *Supported LLM Providers*).
+- Tool-call blocks inside triple-backtick code fences are stripped before parsing, so illustrative examples in the LLM's prose are no longer accidentally executed.
+- Dead parameters (`tools`, `max_turns`) removed from `_call_llm`. They were never forwarded to the SDK; this aligns the code with what it actually does.
+
 **2026-04-18**
 - Added two new `provider` modes that reuse existing flat-rate subscriptions instead of per-token API billing: `claude_cli` (via the local `claude -p` CLI) and `codex_cli` (via the local `codex exec` CLI). Much cheaper when running multiple 24/7 agents in parallel. See the updated *Supported LLM Providers* section for the full API-vs-subscription trade-off table.
 - Provider validation added at dispatcher construction; unknown provider values now fail fast with a clear error instead of silently falling through.
@@ -863,18 +871,21 @@ Works with **both Anthropic and OpenAI** out of the box, and can run on a
 
 ### Authentication mode: API key vs. subscription
 
-| Mode | `provider` value | Billing | Requires |
-|------|------------------|---------|----------|
-| API — Anthropic | `anthropic` | Per-token, via `ANTHROPIC_API_KEY` | `pip install anthropic` |
-| API — OpenAI | `openai` | Per-token, via `OPENAI_API_KEY` | `pip install openai` |
-| **Subscription — Claude** | `claude_cli` | Flat-rate, uses your Claude Code / Pro / Max plan | `claude` CLI installed and logged in |
-| **Subscription — ChatGPT** | `codex_cli` | Flat-rate, uses your ChatGPT Plus / Pro plan | `codex` CLI installed and logged in |
-
-The `*_cli` modes shell out to the headless CLI (`claude -p` / `codex exec`) and
-share your existing subscription quota. This is much cheaper than per-token
-billing when you run multiple agents in parallel or do heavy Think/Reflect
-cycles. Trade-off: no native prompt caching or tool-use protocol — the CLI is
-used as a plain text-in / text-out oracle.
+| Mode | `provider` value | Billing | Requires | Tool-use support |
+|------|------------------|---------|----------|------------------|
+| API — Anthropic | `anthropic` | Per-token, via `ANTHROPIC_API_KEY` | `pip install anthropic` | ✅ Full |
+| API — OpenAI | `openai` | Per-token, via `OPENAI_API_KEY` | `pip install openai` | ✅ Full |
+| **Subscription — Claude** | `claude_cli` | Flat-rate, uses your Claude Code / Pro / Max plan | `claude` CLI installed and logged in | ✅ Full |
+| **Subscription — ChatGPT** | `codex_cli` | Flat-rate, uses your ChatGPT Plus / Pro plan | `codex` CLI installed and logged in | ⚠️ Leader only |
+
+Tool execution is driven by a text-based `<tool_call>` protocol injected
+into the worker's system prompt. All three "Full" providers can be forced
+into pure text-oracle mode so they honor the protocol (for `claude_cli`
+the framework passes `--tools ""` to disable built-in CLI tools). The
+`codex` CLI currently offers no equivalent flag — its internal agentic
+loop will bypass the protocol and the framework cannot recover PIDs from
+experiments it launches. Use `codex_cli` only for the leader/think path
+where no tools are needed.
 
 Switch provider in `config.yaml`:
 ```yaml

diff --git a/config.yaml b/config.yaml
@@ -8,14 +8,23 @@ project:
 
 agent:
   # Provider:
-  #   "anthropic"  — Anthropic SDK, per-token API billing (ANTHROPIC_API_KEY)
-  #   "openai"     — OpenAI SDK, per-token API billing (OPENAI_API_KEY)
-  #   "claude_cli" — `claude -p` subprocess, reuses Claude Code / Pro / Max subscription
-  #   "codex_cli"  — `codex exec` subprocess, reuses ChatGPT Plus / Pro subscription
+  #   "anthropic"  — SDK, per-token API billing (ANTHROPIC_API_KEY)
+  #   "openai"     — SDK, per-token API billing (OPENAI_API_KEY)
+  #   "claude_cli" — subprocess, reuses Claude Code / Pro / Max subscription
+  #   "codex_cli"  — subprocess, reuses ChatGPT Plus / Pro subscription
   #
-  # The *_cli options are much cheaper when running many agents in parallel,
-  # because they share your existing subscription quota instead of per-token billing.
-  # They require the corresponding CLI to be installed and logged in once on this host.
+  # The *_cli options are much cheaper when running many agents in parallel
+  # because they share your existing subscription quota instead of per-token
+  # billing. They require the corresponding CLI to be installed and logged
+  # in once on this host.
+  #
+  # Tool-use compatibility: the worker path drives tools through a text-based
+  # <tool_call> protocol. Compatible providers: "anthropic", "openai",
+  # "claude_cli" (we force `--tools ""` so the CLI cannot bypass it).
+  # "codex_cli" cannot be forced into pure-text mode by any current CLI flag,
+  # so it will silently bypass the ToolRegistry and lose PID tracking. Use it
+  # for the leader/think path only; keep one of the three other providers for
+  # worker dispatches.
   provider: "anthropic"
 
   # Model selection per provider: