diff --git a/AI_GUIDE.md b/AI_GUIDE.md
index 6f042ce..79ff6d5 100644
--- a/AI_GUIDE.md
+++ b/AI_GUIDE.md
@@ -348,6 +348,33 @@ During training (90%+ of time), the agent does NOT call the LLM. It only does:
- **Writing Agent**: generates reports (3 tools)
- Only 1 worker active at a time, others cost $0
+### Tool-Use Protocol (provider-agnostic)
+
+Workers do not use each provider's native SDK tool-use protocol. Instead the
+framework injects a plain-text schema into the system prompt and the worker
+emits tool calls as `{...}` blocks. The dispatcher
+parses the blocks, runs each through `ToolRegistry.execute_tool`, and feeds
+results back as `...` in the next user
+turn. The loop runs until the worker produces a response with no tool calls
+(the final answer) or `max_turns` is reached.
+
+Why this design:
+
+- **One protocol, four providers** — the Anthropic and OpenAI SDK paths use
+ the same text protocol as `claude_cli` and `codex_cli`. No per-provider
+ branching in the execution loop.
+- **Authoritative PID / log_file** — the EXECUTE → MONITOR handoff reads
+ `pid` and `log_file` directly from the `launch_experiment` tool's JSON
+ result, not from regex-scraping the model's prose.
+- **Provider-lock-down** — for `claude_cli` the framework passes
+ `--tools ""` so the CLI cannot bypass the protocol with its own built-in
+ tools. `codex_cli` has no equivalent flag and will silently ignore the
+ protocol; a runtime warning is emitted when it is used as a worker, and
+ users should pick one of the other three providers for worker dispatches.
+- **Fence stripping** — tool-call blocks inside triple-backtick code fences
+ are ignored, so a model's illustrative example in its prose is never
+ accidentally executed.
+
### Safety
- Mandatory dry-run before every real training
- Protected files can't be overwritten
diff --git a/CLAUDE.md b/CLAUDE.md
index a5f74f3..8bda4ba 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -396,6 +396,30 @@ During training (90%+ of time), the agent does NOT call the LLM. It only does:
- **Writing Agent**: generates reports (3 tools)
- Only 1 worker active at a time, others cost $0
+### Tool-Use Protocol (provider-agnostic)
+
+Workers do not use each provider's native SDK tool-use protocol. Instead the
+framework injects a plain-text schema into the system prompt and the worker
+emits tool calls as `{...}` blocks. The dispatcher
+parses the blocks, runs each through `ToolRegistry.execute_tool`, and feeds
+results back as `...` in the next user
+turn. The loop runs until the worker produces a response with no tool calls
+(the final answer) or `max_turns` is reached.
+
+Key properties:
+
+- One text protocol works identically across all four providers — no
+ per-provider branching in the execution loop.
+- `launch_experiment` PID and log_file come authoritatively from the tool
+ result's JSON, not from regex-scraping the model's prose.
+- For `claude_cli` the framework passes `--tools ""` so the CLI cannot
+ bypass the protocol with its own built-in tools. `codex_cli` has no
+ equivalent flag and may silently ignore the protocol; a runtime warning
+ is emitted when it is used as a worker, and users should pick one of the
+ other three providers for worker dispatches.
+- Tool-call blocks inside triple-backtick code fences are ignored, so a
+ model's illustrative example in prose is never accidentally executed.
+
### Safety
- Mandatory dry-run before every real training
- Protected files can't be overwritten
diff --git a/README.md b/README.md
index 0b4666d..a3a1a0e 100644
--- a/README.md
+++ b/README.md
@@ -37,6 +37,14 @@
## Recent Updates
+**2026-04-19**
+- Workers now execute tools through a real multi-turn tool-use loop. The dispatcher injects the tool schema into the system prompt, parses `` blocks from the LLM response, runs each through `ToolRegistry.execute_tool`, feeds results back as `` in the next turn, and iterates until the worker produces a response with no tool calls or `max_turns` is hit. Previously the `tools` argument was accepted and silently dropped, and worker output was regex-scraped for PIDs — closes the gap raised in issue #13.
+- `launch_experiment` PIDs and log file paths are now surfaced directly from the tool result (authoritative), with the old free-text regex retained only as a fallback for pre-protocol responses.
+- `claude_cli` is forced into pure-text mode via `claude -p --tools ""`, so its responses reliably go through the framework's protocol.
+- `codex_cli` cannot be forced into pure-text mode by any current flag; when used as a worker provider the framework now emits a clear warning (see the updated compatibility table in *Supported LLM Providers*).
+- Tool-call blocks inside triple-backtick code fences are stripped before parsing, so illustrative examples in the LLM's prose are no longer accidentally executed.
+- Dead parameters (`tools`, `max_turns`) removed from `_call_llm`. They were never forwarded to the SDK; this aligns the code with what it actually does.
+
**2026-04-18**
- Added two new `provider` modes that reuse existing flat-rate subscriptions instead of per-token API billing: `claude_cli` (via the local `claude -p` CLI) and `codex_cli` (via the local `codex exec` CLI). Much cheaper when running multiple 24/7 agents in parallel. See the updated *Supported LLM Providers* section for the full API-vs-subscription trade-off table.
- Provider validation added at dispatcher construction; unknown provider values now fail fast with a clear error instead of silently falling through.
@@ -863,18 +871,21 @@ Works with **both Anthropic and OpenAI** out of the box, and can run on a
### Authentication mode: API key vs. subscription
-| Mode | `provider` value | Billing | Requires |
-|------|------------------|---------|----------|
-| API — Anthropic | `anthropic` | Per-token, via `ANTHROPIC_API_KEY` | `pip install anthropic` |
-| API — OpenAI | `openai` | Per-token, via `OPENAI_API_KEY` | `pip install openai` |
-| **Subscription — Claude** | `claude_cli` | Flat-rate, uses your Claude Code / Pro / Max plan | `claude` CLI installed and logged in |
-| **Subscription — ChatGPT** | `codex_cli` | Flat-rate, uses your ChatGPT Plus / Pro plan | `codex` CLI installed and logged in |
-
-The `*_cli` modes shell out to the headless CLI (`claude -p` / `codex exec`) and
-share your existing subscription quota. This is much cheaper than per-token
-billing when you run multiple agents in parallel or do heavy Think/Reflect
-cycles. Trade-off: no native prompt caching or tool-use protocol — the CLI is
-used as a plain text-in / text-out oracle.
+| Mode | `provider` value | Billing | Requires | Tool-use support |
+|------|------------------|---------|----------|------------------|
+| API — Anthropic | `anthropic` | Per-token, via `ANTHROPIC_API_KEY` | `pip install anthropic` | ✅ Full |
+| API — OpenAI | `openai` | Per-token, via `OPENAI_API_KEY` | `pip install openai` | ✅ Full |
+| **Subscription — Claude** | `claude_cli` | Flat-rate, uses your Claude Code / Pro / Max plan | `claude` CLI installed and logged in | ✅ Full |
+| **Subscription — ChatGPT** | `codex_cli` | Flat-rate, uses your ChatGPT Plus / Pro plan | `codex` CLI installed and logged in | ⚠️ Leader only |
+
+Tool execution is driven by a text-based `` protocol injected
+into the worker's system prompt. All three "Full" providers can be forced
+into pure text-oracle mode so they honor the protocol (for `claude_cli`
+the framework passes `--tools ""` to disable built-in CLI tools). The
+`codex` CLI currently offers no equivalent flag — its internal agentic
+loop will bypass the protocol and the framework cannot recover PIDs from
+experiments it launches. Use `codex_cli` only for the leader/think path
+where no tools are needed.
Switch provider in `config.yaml`:
```yaml
diff --git a/config.yaml b/config.yaml
index 45949de..36e6918 100644
--- a/config.yaml
+++ b/config.yaml
@@ -8,14 +8,23 @@ project:
agent:
# Provider:
- # "anthropic" — Anthropic SDK, per-token API billing (ANTHROPIC_API_KEY)
- # "openai" — OpenAI SDK, per-token API billing (OPENAI_API_KEY)
- # "claude_cli" — `claude -p` subprocess, reuses Claude Code / Pro / Max subscription
- # "codex_cli" — `codex exec` subprocess, reuses ChatGPT Plus / Pro subscription
+ # "anthropic" — SDK, per-token API billing (ANTHROPIC_API_KEY)
+ # "openai" — SDK, per-token API billing (OPENAI_API_KEY)
+ # "claude_cli" — subprocess, reuses Claude Code / Pro / Max subscription
+ # "codex_cli" — subprocess, reuses ChatGPT Plus / Pro subscription
#
- # The *_cli options are much cheaper when running many agents in parallel,
- # because they share your existing subscription quota instead of per-token billing.
- # They require the corresponding CLI to be installed and logged in once on this host.
+ # The *_cli options are much cheaper when running many agents in parallel
+ # because they share your existing subscription quota instead of per-token
+ # billing. They require the corresponding CLI to be installed and logged
+ # in once on this host.
+ #
+ # Tool-use compatibility: the worker path drives tools through a text-based
+ # protocol. Compatible providers: "anthropic", "openai",
+ # "claude_cli" (we force `--tools ""` so the CLI cannot bypass it).
+ # "codex_cli" cannot be forced into pure-text mode by any current CLI flag,
+ # so it will silently bypass the ToolRegistry and lose PID tracking. Use it
+ # for the leader/think path only; keep one of the three other providers for
+ # worker dispatches.
provider: "anthropic"
# Model selection per provider:
diff --git a/core/agents.py b/core/agents.py
index d0fe267..11f38aa 100644
--- a/core/agents.py
+++ b/core/agents.py
@@ -6,10 +6,20 @@
- Workers: Specialized agents (idea/code/writing), spawned on demand
Only ONE worker runs at a time. Others idle at zero token cost.
+
+Tool use is implemented via a provider-agnostic text protocol. The LLM
+emits {...} blocks, the dispatcher executes each
+call through the ToolRegistry, and results are fed back as
+... blocks in the next user turn.
+The loop runs until the worker produces a response with no tool calls
+(the final answer) or max_turns is exceeded. This works uniformly
+across all four providers — the API SDKs don't use their native
+tool-use protocol, and the CLI providers are simply text oracles.
"""
import json
import logging
+import re
from pathlib import Path
from typing import Optional
@@ -20,6 +30,14 @@
AGENTS_DIR = Path(__file__).parent.parent / "agents"
+# Tool-use text protocol
+_TOOL_CALL_RE = re.compile(r"\s*(\{.*?\})\s*", re.DOTALL)
+# Triple-backtick fenced blocks are stripped before parsing so that LLMs can
+# illustrate the protocol inside code fences without triggering real tool
+# execution. Matches ``` with an optional language tag through the next ```.
+_FENCED_BLOCK_RE = re.compile(r"```[^\n]*\n.*?```", re.DOTALL)
+
+
class AgentDispatcher:
"""Dispatches tasks to specialized agents.
@@ -96,47 +114,109 @@ def dispatch_leader(self, task: str, context: dict) -> dict:
"content": self._format_leader_input(task, context),
})
- response = self._call_llm(
- system=system_prompt,
- messages=messages,
- max_turns=10,
- )
+ response = self._call_llm(system=system_prompt, messages=messages)
# Persist conversation for within-cycle coherence
self._leader_history = messages + [{"role": "assistant", "content": response}]
return self._parse_leader_response(response)
- def dispatch_worker(self, agent_type: str, task: str, tools: list) -> dict:
- """Dispatch a task to a worker agent.
+ def dispatch_worker(self, agent_type: str, task: str, tool_registry) -> dict:
+ """Dispatch a task to a worker agent and run its tool-use loop.
- Workers are stateless — each dispatch is independent.
- This keeps token costs predictable.
+ Workers are stateless across dispatches — each call starts with a
+ fresh conversation. Within a single dispatch the conversation is
+ multi-turn: the worker may emit tool calls, receive results, and
+ continue reasoning until it produces a final answer (a response
+ containing no blocks).
Args:
- agent_type: "idea", "code", or "writing"
- task: Task description from the Leader
- tools: Tool definitions to provide
+ agent_type: "idea", "code", or "writing".
+ task: Task description from the Leader.
+ tool_registry: ToolRegistry that provides `get_tools_for` and
+ `execute_tool`. The registry itself is passed in so this
+ module does not have a hard import dependency on tools.py.
Returns:
- Worker's result as a dict
+ Dict with at minimum `agent` and `response`. If the worker
+ called `launch_experiment`, the PID and log_file from that
+ tool result are also surfaced at the top level so the loop's
+ EXECUTE → MONITOR handoff keeps working.
"""
if agent_type not in self.WORKER_CONFIGS:
raise ValueError(f"Unknown agent type: {agent_type}")
+ if tool_registry is None:
+ raise TypeError(
+ "dispatch_worker requires a tool_registry with "
+ "`get_tools_for(agent_type)` and `execute_tool(name, args)` "
+ "methods. Pass a ToolRegistry configured with an empty tool "
+ "list if you want a tool-less worker."
+ )
config = self.WORKER_CONFIGS[agent_type]
- system_prompt = self._load_prompt(config["prompt_file"])
+ base_prompt = self._load_prompt(config["prompt_file"])
+ tool_defs = tool_registry.get_tools_for(agent_type)
+ system_prompt = base_prompt + "\n\n" + self._render_tools_section(tool_defs)
+ max_turns = config["max_turns"]
+
+ # codex_cli hard-codes its own agentic tool loop; it will ignore the
+ # protocol and silently act on its own. That breaks the
+ # EXECUTE → MONITOR handoff (no PID, no log_file from ToolRegistry).
+ # Leader/think dispatches are fine (they do not use tools) but worker
+ # dispatches will likely return a non-authoritative summary. Warn once
+ # per dispatch so users see it in the log without it becoming noise.
+ if self.provider == "codex_cli" and tool_defs:
+ logger.warning(
+ "codex_cli is being used as a worker provider; its CLI does "
+ "not support disabling built-in tools, so it may bypass the "
+ "ToolRegistry and the resulting PID/log_file cannot be "
+ "recovered. For worker dispatches prefer claude_cli, "
+ "anthropic, or openai."
+ )
logger.info(f"Dispatching {agent_type} agent: {task[:100]}...")
- response = self._call_llm(
- system=system_prompt,
- messages=[{"role": "user", "content": task}],
- tools=tools,
- max_turns=config["max_turns"],
- )
+ messages = [{"role": "user", "content": task}]
+ last_response = ""
+ tool_results_log: list[dict] = []
+
+ for turn in range(1, max_turns + 1):
+ last_response = self._call_llm(system=system_prompt, messages=messages)
+
+ tool_calls = self._parse_tool_calls(last_response)
+ if not tool_calls:
+ # No tool calls → worker has produced its final answer.
+ break
+
+ # Echo the assistant turn so the next LLM call sees the history.
+ messages.append({"role": "assistant", "content": last_response})
+
+ # Execute each call and build a single user turn with all results.
+ result_blocks = []
+ for call in tool_calls:
+ name = call.get("name", "")
+ args = call.get("args", {}) or {}
+ if not isinstance(args, dict):
+ tool_output = json.dumps({"error": "`args` must be a JSON object"})
+ else:
+ tool_output = tool_registry.execute_tool(name, args)
+ tool_results_log.append({"name": name, "args": args, "output": tool_output})
+ result_blocks.append(
+ f'\n{tool_output}\n'
+ )
+
+ messages.append({
+ "role": "user",
+ "content": "\n\n".join(result_blocks),
+ })
+ else:
+ # for/else: executed only when the loop exhausts max_turns without break.
+ logger.warning(
+ f"Worker {agent_type} hit max_turns={max_turns} "
+ f"with tool calls still pending; returning last response."
+ )
- result = self._parse_worker_response(response, agent_type)
+ result = self._parse_worker_response(last_response, agent_type, tool_results_log)
logger.info(f"Worker {agent_type} completed: {str(result)[:200]}")
return result
@@ -144,7 +224,88 @@ def reset_leader_history(self):
"""Clear leader conversation history between cycles."""
self._leader_history = []
- def _call_llm(self, system: str, messages: list, tools: list = None, max_turns: int = 10) -> str:
+ @staticmethod
+ def _parse_tool_calls(text: str) -> list[dict]:
+ """Extract {...} blocks from an LLM response.
+
+ Silently skips blocks whose JSON body is malformed. An empty list
+ means the response is a final answer (no tool calls requested).
+
+ Tool-call blocks inside triple-backtick code fences are deliberately
+ ignored: LLMs routinely illustrate the protocol inside fenced blocks
+ when explaining what they are about to do, and executing those
+ illustrations as real side-effectful calls has caused accidental
+ writes in practice.
+ """
+ stripped = _FENCED_BLOCK_RE.sub("", text or "")
+ calls: list[dict] = []
+ for match in _TOOL_CALL_RE.finditer(stripped):
+ body = match.group(1)
+ try:
+ parsed = json.loads(body)
+ except json.JSONDecodeError as exc:
+ logger.warning(f"Skipping malformed tool_call block: {exc}")
+ continue
+ if isinstance(parsed, dict) and parsed.get("name"):
+ calls.append(parsed)
+ else:
+ logger.warning(
+ "Skipping tool_call without a string `name` field: "
+ f"{str(parsed)[:120]}"
+ )
+ return calls
+
+ @staticmethod
+ def _render_tools_section(tool_defs: list[dict]) -> str:
+ """Render tool schemas as a plain-text block appended to the system prompt.
+
+ The worker's own prompt already has a short 'Tools Available' list;
+ this auto-generated section provides the exact machine-readable
+ schemas and protocol instructions so the LLM emits calls in the
+ format the dispatcher can parse.
+ """
+ if not tool_defs:
+ return ""
+
+ lines = [
+ "## Tool-Use Protocol",
+ "",
+ "You have NO direct access to the filesystem, shell, or network.",
+ "To act on the environment you MUST emit `` blocks and",
+ "wait for the framework to return `` blocks in the",
+ "next user turn. Example:",
+ "",
+ " ",
+ ' {"name": "read_file", "args": {"path": "config.yaml"}}',
+ " ",
+ "",
+ "You may emit multiple `` blocks in one message; each",
+ "will be executed and its result returned. When you are finished,",
+ "produce a plain-text message with NO `` blocks — that",
+ "is how the framework knows you are done.",
+ "",
+ "Emit `` blocks at the top level of the message. Do NOT",
+ "wrap them in triple-backtick code fences — fenced blocks are",
+ "treated as illustration, not as real calls.",
+ "",
+ "### Available tools",
+ "",
+ ]
+ for tool in tool_defs:
+ name = tool.get("name", "")
+ desc = tool.get("description", "")
+ schema = tool.get("input_schema", {})
+ lines.append(f"- `{name}` — {desc}")
+ props = schema.get("properties", {}) or {}
+ required = set(schema.get("required", []) or [])
+ for pname, pspec in props.items():
+ ptype = pspec.get("type", "any")
+ pdesc = pspec.get("description", "")
+ flag = "required" if pname in required else "optional"
+ lines.append(f" - `{pname}` ({ptype}, {flag}): {pdesc}")
+ return "\n".join(lines)
+
+ def _call_llm(self, system: str, messages: list) -> str:
"""Call the LLM. Four providers are supported.
- "anthropic": Claude SDK, per-token API billing
@@ -154,8 +315,9 @@ def _call_llm(self, system: str, messages: list, tools: list = None, max_turns:
CLI providers let you reuse existing subscriptions instead of paying per-token,
which is much cheaper when running many agents in parallel or doing heavy
- Think/Reflect cycles. Trade-off: no native prompt caching, no tool-use protocol —
- the CLI is used as a plain text-in / text-out oracle.
+ Think/Reflect cycles. Trade-off: no native prompt caching, no native tool-use
+ protocol — the LLM is driven purely as a text-in / text-out oracle, and tool
+ use is layered on top via the text protocol (see dispatch_worker).
"""
if self.provider == "claude_cli":
return self._call_claude_cli(system, messages)
@@ -289,12 +451,20 @@ def _run_cli(self, argv: list, prompt: str, tool_label: str, install_hint: str,
return (result.stdout or "").strip()
def _call_claude_cli(self, system: str, messages: list) -> str:
- """Headless dispatch via the `claude` CLI, billed against a Pro / Max subscription."""
+ """Headless dispatch via the `claude` CLI, billed against a Pro / Max subscription.
+
+ `--tools ""` disables every built-in tool so the CLI degrades to a
+ pure text oracle. This is required for our protocol
+ to work: the CLI must be unable to act on its own, otherwise it
+ will bypass our ToolRegistry (and the loop loses visibility over
+ what actually happened, especially for launch_experiment PIDs).
+
+ The prompt is piped via stdin to sidestep argv-size limits on
+ large conversation histories.
+ """
prompt = self._flatten_for_cli(system, messages)
- # claude -p reads the prompt from stdin when no argument is given,
- # which avoids argv size limits for long memory contexts.
return self._run_cli(
- argv=["claude", "-p", "--output-format", "text"],
+ argv=["claude", "-p", "--output-format", "text", "--tools", ""],
prompt=prompt,
tool_label="claude",
install_hint="npm i -g @anthropic-ai/claude-code && run `claude` once to sign in",
@@ -302,15 +472,74 @@ def _call_claude_cli(self, system: str, messages: list) -> str:
)
def _call_codex_cli(self, system: str, messages: list) -> str:
- """Headless dispatch via the `codex` CLI, billed against a ChatGPT subscription."""
+ """Headless dispatch via the `codex` CLI, billed against a ChatGPT subscription.
+
+ Unlike `claude -p`, `codex exec` is fully agentic by default — it runs
+ its own internal tool-use loop and there is no CLI flag to disable
+ the built-in tools. That means the framework's protocol
+ is unreliable under this provider: codex will often act on its own
+ and return a final summary. Workers that need to launch experiments
+ (and recover a PID from the ToolRegistry) should therefore prefer
+ claude_cli / anthropic / openai; codex_cli is best kept for the
+ leader/think path where we only need free-text output.
+
+ Flags:
+ - `-o ` captures only the final assistant message
+ instead of the full agentic trace,
+ - `--skip-git-repo-check` allows codex to run in arbitrary dirs
+ (the workspace is typically not a repo).
+ """
+ import subprocess
+ import tempfile
+
prompt = self._flatten_for_cli(system, messages)
- return self._run_cli(
- argv=["codex", "exec"],
- prompt=prompt,
- tool_label="codex",
- install_hint="brew install codex (or see upstream project) then run `codex login`",
- use_stdin=False,
- )
+
+ try:
+ with tempfile.NamedTemporaryFile("w+", suffix=".txt", delete=False) as out:
+ out_path = out.name
+ try:
+ result = subprocess.run(
+ [
+ "codex", "exec",
+ "--skip-git-repo-check",
+ "-o", out_path,
+ prompt,
+ ],
+ capture_output=True,
+ text=True,
+ timeout=600,
+ check=False,
+ )
+ except FileNotFoundError:
+ logger.warning(
+ "codex CLI not found on PATH. "
+ "Install: brew install codex (or see upstream) then `codex login`. "
+ "Falling back to mock response."
+ )
+ return json.dumps({"action": "wait", "reason": "codex CLI missing"})
+ except subprocess.TimeoutExpired:
+ logger.error("codex CLI timed out after 600s")
+ return json.dumps({"action": "wait", "reason": "codex CLI timeout"})
+
+ if result.returncode != 0:
+ stderr_tail = (result.stderr or "").strip().splitlines()[-5:]
+ logger.error(
+ f"codex CLI exited {result.returncode}. "
+ f"Stderr tail: {' | '.join(stderr_tail)}"
+ )
+ return json.dumps({"action": "wait", "reason": "codex CLI error"})
+
+ try:
+ with open(out_path, "r") as f:
+ return f.read().strip()
+ except OSError:
+ # Fall back to stdout if --output-last-message didn't produce a file.
+ return (result.stdout or "").strip()
+ finally:
+ try:
+ Path(out_path).unlink(missing_ok=True)
+ except OSError:
+ pass
def _load_prompt(self, filename: str) -> str:
"""Load agent prompt from agents/ directory."""
@@ -358,16 +587,42 @@ def _parse_leader_response(self, response: str) -> dict:
"task": response,
}
- def _parse_worker_response(self, response: str, agent_type: str) -> dict:
- """Parse worker response into structured result."""
+ def _parse_worker_response(self, response: str, agent_type: str,
+ tool_results: Optional[list] = None) -> dict:
+ """Parse worker response into a structured result dict.
+
+ When the worker used the `launch_experiment` tool, the PID and
+ log_file come directly from that tool's JSON result — this is
+ authoritative. The regex-on-free-text path is retained as a
+ fallback for responses that report an experiment launch purely
+ in prose (or for older prompts that predate the tool-use loop).
+ """
result = {"agent": agent_type, "response": response}
+ if tool_results:
+ result["tool_calls"] = len(tool_results)
- # Check for experiment launch indicators
if agent_type == "code":
- if "PID" in response or "launched" in response.lower():
+ # Prefer authoritative tool-result data over text parsing.
+ launch_result = None
+ if tool_results:
+ for entry in reversed(tool_results):
+ if entry.get("name") == "launch_experiment":
+ launch_result = entry
+ break
+ if launch_result is not None:
+ try:
+ payload = json.loads(launch_result.get("output", "{}"))
+ except (json.JSONDecodeError, TypeError):
+ payload = {}
+ if isinstance(payload, dict) and payload.get("pid") is not None:
+ result["experiment_launched"] = True
+ result["pid"] = int(payload["pid"])
+ if payload.get("log_file"):
+ result["log_file"] = payload["log_file"]
+
+ # Fallback: scrape PID from free-text response.
+ if "pid" not in result and ("PID" in response or "launched" in response.lower()):
result["experiment_launched"] = True
- # Try to extract PID
- import re
pid_match = re.search(r"PID[=:\s]+(\d+)", response)
if pid_match:
result["pid"] = int(pid_match.group(1))
diff --git a/core/loop.py b/core/loop.py
index 7131a57..e0d323c 100644
--- a/core/loop.py
+++ b/core/loop.py
@@ -208,7 +208,7 @@ def _execute(self, plan: dict) -> dict:
result = self.dispatcher.dispatch_worker(
agent_type=agent_type,
task=task_description,
- tools=self.tools.get_tools_for(agent_type),
+ tool_registry=self.tools,
)
return result
diff --git a/docs/architecture.md b/docs/architecture.md
index 4234b81..e5e934e 100644
--- a/docs/architecture.md
+++ b/docs/architecture.md
@@ -96,7 +96,45 @@ No LLM API calls until training completes.
- 4 tools = 800 extra tokens per call
- Over 100 API calls/day, that's 220K tokens saved
-### 6. GPU Utilities (`gpu/`)
+### 6. Tool-Use Protocol (`core/agents.py::dispatch_worker`)
+
+Workers drive tool calls through a provider-agnostic text protocol rather
+than each SDK's native tool-use API:
+
+1. The dispatcher renders the worker's tool schemas as a plain-text
+ `## Tool-Use Protocol` section and appends it to the system prompt.
+2. The worker emits zero or more `{"name": "...", "args": {...}}`
+ blocks in its response.
+3. For each block, the dispatcher calls `ToolRegistry.execute_tool` and
+ packages the JSON result into a `...`
+ block appended to the next user turn.
+4. The loop iterates until the worker returns a message with no tool calls
+ (the final answer) or `max_turns` is reached.
+
+Design rationale:
+
+- **Uniform behaviour across four providers.** The same protocol works
+ whether the LLM is reached via the Anthropic SDK, the OpenAI SDK, the
+ `claude` CLI, or the `codex` CLI. The execution loop contains no
+ per-provider branching.
+- **Authoritative experiment hand-off.** `pid` and `log_file` flow from
+ the `launch_experiment` tool result (structured JSON) to
+ `_parse_worker_response`, which promotes them onto the top-level result
+ dict read by `loop._monitor_experiment`. Regex-on-prose remains as a
+ fallback only.
+- **CLI lock-down.** `claude_cli` is invoked with `--tools ""` so the
+ Claude Code CLI cannot bypass the protocol using its built-in tools.
+ `codex_cli` has no equivalent flag, so it may silently act on its own;
+ `dispatch_worker` logs a warning when `codex_cli` is used as a worker
+ provider, and the README compatibility table flags it accordingly.
+- **Fence stripping.** Tool-call blocks inside triple-backtick code fences
+ are removed before parsing so that models illustrating the protocol in
+ their prose do not trigger real side-effectful tool execution.
+- **Bounded execution.** `max_turns` is configured per-worker
+ (`idea=12`, `code=40`, `writing=30`); on overflow the loop exits cleanly
+ and the last response is returned with a warning.
+
+### 7. GPU Utilities (`gpu/`)
- **detect.py**: Auto-detect GPUs, check availability, reserve last GPU
- **keeper.py**: Keep cloud instances alive with minimal GPU activity
diff --git a/tests/integration_cli_tool_use.py b/tests/integration_cli_tool_use.py
new file mode 100644
index 0000000..2de0f86
--- /dev/null
+++ b/tests/integration_cli_tool_use.py
@@ -0,0 +1,58 @@
+"""Live integration check: drive a real CLI-provider worker end-to-end.
+
+Run manually: python -m tests.integration_cli_tool_use
+
+Burns one subscription round-trip per provider. Skipped automatically
+if the CLI is not on PATH. Not wired into the normal unittest suite.
+"""
+
+import shutil
+import tempfile
+from pathlib import Path
+
+from core.agents import AgentDispatcher
+from core.tools import ToolRegistry
+
+
+TASK = (
+ "Your one job: create a file named hello.txt in the workspace "
+ "containing exactly the three-word sentence 'integration test ok', "
+ "then confirm by listing the files. Once done, reply with a short "
+ "success message and no further tool calls."
+)
+
+
+def _run(provider: str) -> dict:
+ binary = {"claude_cli": "claude", "codex_cli": "codex"}[provider]
+ if shutil.which(binary) is None:
+ return {"provider": provider, "skipped": f"{binary} not on PATH"}
+
+ dispatcher = AgentDispatcher(provider=provider)
+ with tempfile.TemporaryDirectory() as tmp:
+ workspace = Path(tmp)
+ registry = ToolRegistry(workspace)
+ try:
+ result = dispatcher.dispatch_worker("writing", TASK, registry)
+ except Exception as exc:
+ return {"provider": provider, "error": repr(exc)}
+
+ hello = workspace / "hello.txt"
+ return {
+ "provider": provider,
+ "tool_calls": result.get("tool_calls", 0),
+ "file_created": hello.exists(),
+ "file_content": hello.read_text() if hello.exists() else None,
+ "response_tail": (result.get("response", "") or "")[-200:],
+ }
+
+
+def main():
+ for provider in ("claude_cli", "codex_cli"):
+ print(f"\n=== {provider} ===")
+ outcome = _run(provider)
+ for k, v in outcome.items():
+ print(f" {k}: {v}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/tests/test_tool_use_loop.py b/tests/test_tool_use_loop.py
new file mode 100644
index 0000000..b6c295f
--- /dev/null
+++ b/tests/test_tool_use_loop.py
@@ -0,0 +1,295 @@
+"""Tests for the worker tool-use loop in core/agents.py.
+
+The loop must:
+ - run multiple turns until the model stops emitting blocks,
+ - feed each tool's output back as a in the next user turn,
+ - respect max_turns as a hard ceiling,
+ - surface PID/log_file from a launch_experiment tool call so the EXECUTE
+ → MONITOR handoff in core/loop.py still works,
+ - ignore malformed JSON bodies without crashing.
+"""
+
+import json
+import tempfile
+import unittest
+from pathlib import Path
+from unittest.mock import patch
+
+from core.agents import AgentDispatcher
+from core.tools import ToolRegistry
+
+
+def _make_dispatcher():
+ # Provider choice is irrelevant because we stub _call_llm everywhere.
+ return AgentDispatcher(provider="anthropic")
+
+
+class ParseToolCallsTests(unittest.TestCase):
+ def test_extracts_multiple_blocks_in_order(self):
+ text = """
+ I will take two actions.
+
+ {"name": "read_file", "args": {"path": "a.txt"}}
+
+ Some prose in between.
+
+ {"name": "write_file", "args": {"path": "b.txt", "content": "hi"}}
+
+ """
+ calls = AgentDispatcher._parse_tool_calls(text)
+ self.assertEqual([c["name"] for c in calls], ["read_file", "write_file"])
+ self.assertEqual(calls[1]["args"]["path"], "b.txt")
+
+ def test_empty_when_no_blocks(self):
+ self.assertEqual(AgentDispatcher._parse_tool_calls("final answer"), [])
+
+ def test_skips_malformed_json(self):
+ text = """
+ {"name": "ok", "args": {}}
+ {not valid json
+ "just a string"
+ """
+ calls = AgentDispatcher._parse_tool_calls(text)
+ self.assertEqual(len(calls), 1)
+ self.assertEqual(calls[0]["name"], "ok")
+
+ def test_ignores_tool_calls_inside_code_fences(self):
+ """LLMs often illustrate the protocol inside fenced blocks when
+ explaining themselves; those illustrations must NOT execute."""
+ text = '''
+ Here is how you would call the tool:
+
+ ```
+
+ {"name": "write_file", "args": {"path": "DANGER.txt", "content": "pwned"}}
+
+ ```
+
+ But for this task I do not need any tools.
+ '''
+ self.assertEqual(AgentDispatcher._parse_tool_calls(text), [])
+
+ def test_mix_of_fenced_illustration_and_real_call(self):
+ """A real top-level call must still be picked up even if the message
+ also contains an illustrative fenced example."""
+ text = '''
+ For reference, the general form looks like:
+
+ ```
+
+ {"name": "write_file", "args": {"path": "EXAMPLE.txt", "content": "x"}}
+
+ ```
+
+ Now I will do the real call:
+
+
+ {"name": "read_file", "args": {"path": "actual.txt"}}
+
+ '''
+ calls = AgentDispatcher._parse_tool_calls(text)
+ self.assertEqual(len(calls), 1)
+ self.assertEqual(calls[0]["name"], "read_file")
+ self.assertEqual(calls[0]["args"]["path"], "actual.txt")
+
+ def test_tolerates_missing_args_key(self):
+ """A tool_call without an `args` field is valid; args default to {}."""
+ text = '{"name": "list_files"}'
+ calls = AgentDispatcher._parse_tool_calls(text)
+ self.assertEqual(len(calls), 1)
+ self.assertEqual(calls[0]["name"], "list_files")
+
+ def test_rejects_args_that_is_not_a_dict(self):
+ """If the LLM emits args as a string or list, the dispatcher must
+ surface a structured error rather than crashing on **kwargs."""
+ dispatcher = _make_dispatcher()
+ registry = _FakeRegistry(
+ tools=[{"name": "read_file", "description": "", "input_schema": {}}],
+ outputs={},
+ )
+ turns = [
+ '{"name": "read_file", "args": "not-a-dict"}',
+ "giving up",
+ ]
+ with patch.object(dispatcher, "_call_llm", side_effect=turns):
+ dispatcher.dispatch_worker("writing", "t", registry)
+ # The registry must NOT have been called with a non-dict args payload.
+ self.assertEqual(registry.calls, [])
+
+
+class RenderToolsSectionTests(unittest.TestCase):
+ def test_renders_schema_properties(self):
+ tool = {
+ "name": "search_papers",
+ "description": "Search Semantic Scholar.",
+ "input_schema": {
+ "type": "object",
+ "properties": {
+ "query": {"type": "string", "description": "Search query"},
+ "limit": {"type": "integer", "description": "Max results"},
+ },
+ "required": ["query"],
+ },
+ }
+ rendered = AgentDispatcher._render_tools_section([tool])
+ self.assertIn("search_papers", rendered)
+ self.assertIn("", rendered)
+ self.assertIn("query", rendered)
+ self.assertIn("required", rendered)
+ self.assertIn("Max results", rendered)
+
+ def test_empty_when_no_tools(self):
+ self.assertEqual(AgentDispatcher._render_tools_section([]), "")
+
+
+class _FakeRegistry:
+ """Minimal ToolRegistry stub that records calls and returns canned outputs."""
+
+ def __init__(self, tools, outputs):
+ self._tools = tools
+ self._outputs = outputs # dict: tool_name -> json string
+ self.calls: list[tuple[str, dict]] = []
+
+ def get_tools_for(self, agent_type):
+ return self._tools
+
+ def execute_tool(self, name, args):
+ self.calls.append((name, args))
+ return self._outputs.get(name, json.dumps({"ok": True}))
+
+
+class DispatchWorkerLoopTests(unittest.TestCase):
+ def test_none_registry_raises_clear_typeerror(self):
+ """Passing tool_registry=None must fail at the boundary with a
+ clear TypeError, not with a cryptic AttributeError deep in the
+ loop. External callers of dispatch_worker will hit this edge."""
+ dispatcher = _make_dispatcher()
+ with self.assertRaises(TypeError) as ctx:
+ dispatcher.dispatch_worker("writing", "task", None)
+ self.assertIn("tool_registry", str(ctx.exception))
+ self.assertIn("get_tools_for", str(ctx.exception))
+
+ def test_unknown_agent_type_raises_before_touching_registry(self):
+ """Agent-type validation should happen first so that a bad
+ agent_type produces ValueError regardless of registry state."""
+ dispatcher = _make_dispatcher()
+ with self.assertRaises(ValueError):
+ dispatcher.dispatch_worker("bogus_agent", "task", None)
+
+ def test_terminates_when_response_has_no_tool_calls(self):
+ dispatcher = _make_dispatcher()
+ registry = _FakeRegistry(tools=[], outputs={})
+
+ with patch.object(dispatcher, "_call_llm", return_value="all done, no tools"):
+ result = dispatcher.dispatch_worker("writing", "task", registry)
+
+ self.assertEqual(registry.calls, [])
+ self.assertEqual(result["agent"], "writing")
+ self.assertIn("all done", result["response"])
+
+ def test_executes_tools_and_feeds_results_back(self):
+ dispatcher = _make_dispatcher()
+ fake_tools = [{"name": "read_file", "description": "read",
+ "input_schema": {"type": "object", "properties": {}}}]
+ registry = _FakeRegistry(
+ tools=fake_tools,
+ outputs={"read_file": json.dumps({"content": "file body"})},
+ )
+
+ turns = [
+ '{"name": "read_file", "args": {"path": "a.txt"}}',
+ "Done reading, here is my summary.",
+ ]
+ call_log: list[list] = []
+
+ def fake_call(system, messages):
+ call_log.append(list(messages))
+ return turns.pop(0)
+
+ with patch.object(dispatcher, "_call_llm", side_effect=fake_call):
+ result = dispatcher.dispatch_worker("writing", "task", registry)
+
+ # One tool call was executed with the right args.
+ self.assertEqual(registry.calls, [("read_file", {"path": "a.txt"})])
+ # Second LLM turn saw the assistant's tool_call and a user tool_result.
+ self.assertEqual(len(call_log), 2)
+ second_turn_messages = call_log[1]
+ self.assertEqual(second_turn_messages[0]["role"], "user") # original task
+ self.assertEqual(second_turn_messages[1]["role"], "assistant") # tool call echo
+ self.assertEqual(second_turn_messages[2]["role"], "user") # tool_result block
+ self.assertIn("{"name": "launch_experiment", '
+ '"args": {"command": "python train.py", "log_file": "exp.log"}}'
+ '',
+ # Deliberately give a prose reply that lies about the PID — tool result should win.
+ "Training started, PID=99999 (this number is wrong).",
+ ]
+
+ with patch.object(dispatcher, "_call_llm", side_effect=turns):
+ result = dispatcher.dispatch_worker("code", "launch it", registry)
+
+ self.assertTrue(result["experiment_launched"])
+ self.assertEqual(result["pid"], 4321)
+ self.assertEqual(result["log_file"], "/tmp/exp.log")
+
+ def test_end_to_end_with_real_registry(self):
+ """Smoke test with a real ToolRegistry and a temp workspace."""
+ dispatcher = _make_dispatcher()
+ with tempfile.TemporaryDirectory() as tmp:
+ workspace = Path(tmp)
+ registry = ToolRegistry(workspace)
+
+ turns = [
+ '{"name": "write_file", '
+ '"args": {"path": "note.txt", "content": "hello"}}',
+ '{"name": "read_file", '
+ '"args": {"path": "note.txt"}}',
+ "I wrote and read the file successfully.",
+ ]
+
+ with patch.object(dispatcher, "_call_llm", side_effect=turns):
+ result = dispatcher.dispatch_worker("writing", "do it", registry)
+
+ self.assertEqual(result["tool_calls"], 2)
+ self.assertTrue((workspace / "note.txt").exists())
+ self.assertEqual((workspace / "note.txt").read_text(), "hello")
+
+
+if __name__ == "__main__":
+ unittest.main()