diff --git a/AI_GUIDE.md b/AI_GUIDE.md index 6f042ce..79ff6d5 100644 --- a/AI_GUIDE.md +++ b/AI_GUIDE.md @@ -348,6 +348,33 @@ During training (90%+ of time), the agent does NOT call the LLM. It only does: - **Writing Agent**: generates reports (3 tools) - Only 1 worker active at a time, others cost $0 +### Tool-Use Protocol (provider-agnostic) + +Workers do not use each provider's native SDK tool-use protocol. Instead the +framework injects a plain-text schema into the system prompt and the worker +emits tool calls as `{...}` blocks. The dispatcher +parses the blocks, runs each through `ToolRegistry.execute_tool`, and feeds +results back as `...` in the next user +turn. The loop runs until the worker produces a response with no tool calls +(the final answer) or `max_turns` is reached. + +Why this design: + +- **One protocol, four providers** — the Anthropic and OpenAI SDK paths use + the same text protocol as `claude_cli` and `codex_cli`. No per-provider + branching in the execution loop. +- **Authoritative PID / log_file** — the EXECUTE → MONITOR handoff reads + `pid` and `log_file` directly from the `launch_experiment` tool's JSON + result, not from regex-scraping the model's prose. +- **Provider-lock-down** — for `claude_cli` the framework passes + `--tools ""` so the CLI cannot bypass the protocol with its own built-in + tools. `codex_cli` has no equivalent flag and will silently ignore the + protocol; a runtime warning is emitted when it is used as a worker, and + users should pick one of the other three providers for worker dispatches. +- **Fence stripping** — tool-call blocks inside triple-backtick code fences + are ignored, so a model's illustrative example in its prose is never + accidentally executed. + ### Safety - Mandatory dry-run before every real training - Protected files can't be overwritten diff --git a/CLAUDE.md b/CLAUDE.md index a5f74f3..8bda4ba 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -396,6 +396,30 @@ During training (90%+ of time), the agent does NOT call the LLM. It only does: - **Writing Agent**: generates reports (3 tools) - Only 1 worker active at a time, others cost $0 +### Tool-Use Protocol (provider-agnostic) + +Workers do not use each provider's native SDK tool-use protocol. Instead the +framework injects a plain-text schema into the system prompt and the worker +emits tool calls as `{...}` blocks. The dispatcher +parses the blocks, runs each through `ToolRegistry.execute_tool`, and feeds +results back as `...` in the next user +turn. The loop runs until the worker produces a response with no tool calls +(the final answer) or `max_turns` is reached. + +Key properties: + +- One text protocol works identically across all four providers — no + per-provider branching in the execution loop. +- `launch_experiment` PID and log_file come authoritatively from the tool + result's JSON, not from regex-scraping the model's prose. +- For `claude_cli` the framework passes `--tools ""` so the CLI cannot + bypass the protocol with its own built-in tools. `codex_cli` has no + equivalent flag and may silently ignore the protocol; a runtime warning + is emitted when it is used as a worker, and users should pick one of the + other three providers for worker dispatches. +- Tool-call blocks inside triple-backtick code fences are ignored, so a + model's illustrative example in prose is never accidentally executed. + ### Safety - Mandatory dry-run before every real training - Protected files can't be overwritten diff --git a/README.md b/README.md index 0b4666d..a3a1a0e 100644 --- a/README.md +++ b/README.md @@ -37,6 +37,14 @@ ## Recent Updates +**2026-04-19** +- Workers now execute tools through a real multi-turn tool-use loop. The dispatcher injects the tool schema into the system prompt, parses `` blocks from the LLM response, runs each through `ToolRegistry.execute_tool`, feeds results back as `` in the next turn, and iterates until the worker produces a response with no tool calls or `max_turns` is hit. Previously the `tools` argument was accepted and silently dropped, and worker output was regex-scraped for PIDs — closes the gap raised in issue #13. +- `launch_experiment` PIDs and log file paths are now surfaced directly from the tool result (authoritative), with the old free-text regex retained only as a fallback for pre-protocol responses. +- `claude_cli` is forced into pure-text mode via `claude -p --tools ""`, so its responses reliably go through the framework's protocol. +- `codex_cli` cannot be forced into pure-text mode by any current flag; when used as a worker provider the framework now emits a clear warning (see the updated compatibility table in *Supported LLM Providers*). +- Tool-call blocks inside triple-backtick code fences are stripped before parsing, so illustrative examples in the LLM's prose are no longer accidentally executed. +- Dead parameters (`tools`, `max_turns`) removed from `_call_llm`. They were never forwarded to the SDK; this aligns the code with what it actually does. + **2026-04-18** - Added two new `provider` modes that reuse existing flat-rate subscriptions instead of per-token API billing: `claude_cli` (via the local `claude -p` CLI) and `codex_cli` (via the local `codex exec` CLI). Much cheaper when running multiple 24/7 agents in parallel. See the updated *Supported LLM Providers* section for the full API-vs-subscription trade-off table. - Provider validation added at dispatcher construction; unknown provider values now fail fast with a clear error instead of silently falling through. @@ -863,18 +871,21 @@ Works with **both Anthropic and OpenAI** out of the box, and can run on a ### Authentication mode: API key vs. subscription -| Mode | `provider` value | Billing | Requires | -|------|------------------|---------|----------| -| API — Anthropic | `anthropic` | Per-token, via `ANTHROPIC_API_KEY` | `pip install anthropic` | -| API — OpenAI | `openai` | Per-token, via `OPENAI_API_KEY` | `pip install openai` | -| **Subscription — Claude** | `claude_cli` | Flat-rate, uses your Claude Code / Pro / Max plan | `claude` CLI installed and logged in | -| **Subscription — ChatGPT** | `codex_cli` | Flat-rate, uses your ChatGPT Plus / Pro plan | `codex` CLI installed and logged in | - -The `*_cli` modes shell out to the headless CLI (`claude -p` / `codex exec`) and -share your existing subscription quota. This is much cheaper than per-token -billing when you run multiple agents in parallel or do heavy Think/Reflect -cycles. Trade-off: no native prompt caching or tool-use protocol — the CLI is -used as a plain text-in / text-out oracle. +| Mode | `provider` value | Billing | Requires | Tool-use support | +|------|------------------|---------|----------|------------------| +| API — Anthropic | `anthropic` | Per-token, via `ANTHROPIC_API_KEY` | `pip install anthropic` | ✅ Full | +| API — OpenAI | `openai` | Per-token, via `OPENAI_API_KEY` | `pip install openai` | ✅ Full | +| **Subscription — Claude** | `claude_cli` | Flat-rate, uses your Claude Code / Pro / Max plan | `claude` CLI installed and logged in | ✅ Full | +| **Subscription — ChatGPT** | `codex_cli` | Flat-rate, uses your ChatGPT Plus / Pro plan | `codex` CLI installed and logged in | ⚠️ Leader only | + +Tool execution is driven by a text-based `` protocol injected +into the worker's system prompt. All three "Full" providers can be forced +into pure text-oracle mode so they honor the protocol (for `claude_cli` +the framework passes `--tools ""` to disable built-in CLI tools). The +`codex` CLI currently offers no equivalent flag — its internal agentic +loop will bypass the protocol and the framework cannot recover PIDs from +experiments it launches. Use `codex_cli` only for the leader/think path +where no tools are needed. Switch provider in `config.yaml`: ```yaml diff --git a/config.yaml b/config.yaml index 45949de..36e6918 100644 --- a/config.yaml +++ b/config.yaml @@ -8,14 +8,23 @@ project: agent: # Provider: - # "anthropic" — Anthropic SDK, per-token API billing (ANTHROPIC_API_KEY) - # "openai" — OpenAI SDK, per-token API billing (OPENAI_API_KEY) - # "claude_cli" — `claude -p` subprocess, reuses Claude Code / Pro / Max subscription - # "codex_cli" — `codex exec` subprocess, reuses ChatGPT Plus / Pro subscription + # "anthropic" — SDK, per-token API billing (ANTHROPIC_API_KEY) + # "openai" — SDK, per-token API billing (OPENAI_API_KEY) + # "claude_cli" — subprocess, reuses Claude Code / Pro / Max subscription + # "codex_cli" — subprocess, reuses ChatGPT Plus / Pro subscription # - # The *_cli options are much cheaper when running many agents in parallel, - # because they share your existing subscription quota instead of per-token billing. - # They require the corresponding CLI to be installed and logged in once on this host. + # The *_cli options are much cheaper when running many agents in parallel + # because they share your existing subscription quota instead of per-token + # billing. They require the corresponding CLI to be installed and logged + # in once on this host. + # + # Tool-use compatibility: the worker path drives tools through a text-based + # protocol. Compatible providers: "anthropic", "openai", + # "claude_cli" (we force `--tools ""` so the CLI cannot bypass it). + # "codex_cli" cannot be forced into pure-text mode by any current CLI flag, + # so it will silently bypass the ToolRegistry and lose PID tracking. Use it + # for the leader/think path only; keep one of the three other providers for + # worker dispatches. provider: "anthropic" # Model selection per provider: diff --git a/core/agents.py b/core/agents.py index d0fe267..11f38aa 100644 --- a/core/agents.py +++ b/core/agents.py @@ -6,10 +6,20 @@ - Workers: Specialized agents (idea/code/writing), spawned on demand Only ONE worker runs at a time. Others idle at zero token cost. + +Tool use is implemented via a provider-agnostic text protocol. The LLM +emits {...} blocks, the dispatcher executes each +call through the ToolRegistry, and results are fed back as +... blocks in the next user turn. +The loop runs until the worker produces a response with no tool calls +(the final answer) or max_turns is exceeded. This works uniformly +across all four providers — the API SDKs don't use their native +tool-use protocol, and the CLI providers are simply text oracles. """ import json import logging +import re from pathlib import Path from typing import Optional @@ -20,6 +30,14 @@ AGENTS_DIR = Path(__file__).parent.parent / "agents" +# Tool-use text protocol +_TOOL_CALL_RE = re.compile(r"\s*(\{.*?\})\s*", re.DOTALL) +# Triple-backtick fenced blocks are stripped before parsing so that LLMs can +# illustrate the protocol inside code fences without triggering real tool +# execution. Matches ``` with an optional language tag through the next ```. +_FENCED_BLOCK_RE = re.compile(r"```[^\n]*\n.*?```", re.DOTALL) + + class AgentDispatcher: """Dispatches tasks to specialized agents. @@ -96,47 +114,109 @@ def dispatch_leader(self, task: str, context: dict) -> dict: "content": self._format_leader_input(task, context), }) - response = self._call_llm( - system=system_prompt, - messages=messages, - max_turns=10, - ) + response = self._call_llm(system=system_prompt, messages=messages) # Persist conversation for within-cycle coherence self._leader_history = messages + [{"role": "assistant", "content": response}] return self._parse_leader_response(response) - def dispatch_worker(self, agent_type: str, task: str, tools: list) -> dict: - """Dispatch a task to a worker agent. + def dispatch_worker(self, agent_type: str, task: str, tool_registry) -> dict: + """Dispatch a task to a worker agent and run its tool-use loop. - Workers are stateless — each dispatch is independent. - This keeps token costs predictable. + Workers are stateless across dispatches — each call starts with a + fresh conversation. Within a single dispatch the conversation is + multi-turn: the worker may emit tool calls, receive results, and + continue reasoning until it produces a final answer (a response + containing no blocks). Args: - agent_type: "idea", "code", or "writing" - task: Task description from the Leader - tools: Tool definitions to provide + agent_type: "idea", "code", or "writing". + task: Task description from the Leader. + tool_registry: ToolRegistry that provides `get_tools_for` and + `execute_tool`. The registry itself is passed in so this + module does not have a hard import dependency on tools.py. Returns: - Worker's result as a dict + Dict with at minimum `agent` and `response`. If the worker + called `launch_experiment`, the PID and log_file from that + tool result are also surfaced at the top level so the loop's + EXECUTE → MONITOR handoff keeps working. """ if agent_type not in self.WORKER_CONFIGS: raise ValueError(f"Unknown agent type: {agent_type}") + if tool_registry is None: + raise TypeError( + "dispatch_worker requires a tool_registry with " + "`get_tools_for(agent_type)` and `execute_tool(name, args)` " + "methods. Pass a ToolRegistry configured with an empty tool " + "list if you want a tool-less worker." + ) config = self.WORKER_CONFIGS[agent_type] - system_prompt = self._load_prompt(config["prompt_file"]) + base_prompt = self._load_prompt(config["prompt_file"]) + tool_defs = tool_registry.get_tools_for(agent_type) + system_prompt = base_prompt + "\n\n" + self._render_tools_section(tool_defs) + max_turns = config["max_turns"] + + # codex_cli hard-codes its own agentic tool loop; it will ignore the + # protocol and silently act on its own. That breaks the + # EXECUTE → MONITOR handoff (no PID, no log_file from ToolRegistry). + # Leader/think dispatches are fine (they do not use tools) but worker + # dispatches will likely return a non-authoritative summary. Warn once + # per dispatch so users see it in the log without it becoming noise. + if self.provider == "codex_cli" and tool_defs: + logger.warning( + "codex_cli is being used as a worker provider; its CLI does " + "not support disabling built-in tools, so it may bypass the " + "ToolRegistry and the resulting PID/log_file cannot be " + "recovered. For worker dispatches prefer claude_cli, " + "anthropic, or openai." + ) logger.info(f"Dispatching {agent_type} agent: {task[:100]}...") - response = self._call_llm( - system=system_prompt, - messages=[{"role": "user", "content": task}], - tools=tools, - max_turns=config["max_turns"], - ) + messages = [{"role": "user", "content": task}] + last_response = "" + tool_results_log: list[dict] = [] + + for turn in range(1, max_turns + 1): + last_response = self._call_llm(system=system_prompt, messages=messages) + + tool_calls = self._parse_tool_calls(last_response) + if not tool_calls: + # No tool calls → worker has produced its final answer. + break + + # Echo the assistant turn so the next LLM call sees the history. + messages.append({"role": "assistant", "content": last_response}) + + # Execute each call and build a single user turn with all results. + result_blocks = [] + for call in tool_calls: + name = call.get("name", "") + args = call.get("args", {}) or {} + if not isinstance(args, dict): + tool_output = json.dumps({"error": "`args` must be a JSON object"}) + else: + tool_output = tool_registry.execute_tool(name, args) + tool_results_log.append({"name": name, "args": args, "output": tool_output}) + result_blocks.append( + f'\n{tool_output}\n' + ) + + messages.append({ + "role": "user", + "content": "\n\n".join(result_blocks), + }) + else: + # for/else: executed only when the loop exhausts max_turns without break. + logger.warning( + f"Worker {agent_type} hit max_turns={max_turns} " + f"with tool calls still pending; returning last response." + ) - result = self._parse_worker_response(response, agent_type) + result = self._parse_worker_response(last_response, agent_type, tool_results_log) logger.info(f"Worker {agent_type} completed: {str(result)[:200]}") return result @@ -144,7 +224,88 @@ def reset_leader_history(self): """Clear leader conversation history between cycles.""" self._leader_history = [] - def _call_llm(self, system: str, messages: list, tools: list = None, max_turns: int = 10) -> str: + @staticmethod + def _parse_tool_calls(text: str) -> list[dict]: + """Extract {...} blocks from an LLM response. + + Silently skips blocks whose JSON body is malformed. An empty list + means the response is a final answer (no tool calls requested). + + Tool-call blocks inside triple-backtick code fences are deliberately + ignored: LLMs routinely illustrate the protocol inside fenced blocks + when explaining what they are about to do, and executing those + illustrations as real side-effectful calls has caused accidental + writes in practice. + """ + stripped = _FENCED_BLOCK_RE.sub("", text or "") + calls: list[dict] = [] + for match in _TOOL_CALL_RE.finditer(stripped): + body = match.group(1) + try: + parsed = json.loads(body) + except json.JSONDecodeError as exc: + logger.warning(f"Skipping malformed tool_call block: {exc}") + continue + if isinstance(parsed, dict) and parsed.get("name"): + calls.append(parsed) + else: + logger.warning( + "Skipping tool_call without a string `name` field: " + f"{str(parsed)[:120]}" + ) + return calls + + @staticmethod + def _render_tools_section(tool_defs: list[dict]) -> str: + """Render tool schemas as a plain-text block appended to the system prompt. + + The worker's own prompt already has a short 'Tools Available' list; + this auto-generated section provides the exact machine-readable + schemas and protocol instructions so the LLM emits calls in the + format the dispatcher can parse. + """ + if not tool_defs: + return "" + + lines = [ + "## Tool-Use Protocol", + "", + "You have NO direct access to the filesystem, shell, or network.", + "To act on the environment you MUST emit `` blocks and", + "wait for the framework to return `` blocks in the", + "next user turn. Example:", + "", + " ", + ' {"name": "read_file", "args": {"path": "config.yaml"}}', + " ", + "", + "You may emit multiple `` blocks in one message; each", + "will be executed and its result returned. When you are finished,", + "produce a plain-text message with NO `` blocks — that", + "is how the framework knows you are done.", + "", + "Emit `` blocks at the top level of the message. Do NOT", + "wrap them in triple-backtick code fences — fenced blocks are", + "treated as illustration, not as real calls.", + "", + "### Available tools", + "", + ] + for tool in tool_defs: + name = tool.get("name", "") + desc = tool.get("description", "") + schema = tool.get("input_schema", {}) + lines.append(f"- `{name}` — {desc}") + props = schema.get("properties", {}) or {} + required = set(schema.get("required", []) or []) + for pname, pspec in props.items(): + ptype = pspec.get("type", "any") + pdesc = pspec.get("description", "") + flag = "required" if pname in required else "optional" + lines.append(f" - `{pname}` ({ptype}, {flag}): {pdesc}") + return "\n".join(lines) + + def _call_llm(self, system: str, messages: list) -> str: """Call the LLM. Four providers are supported. - "anthropic": Claude SDK, per-token API billing @@ -154,8 +315,9 @@ def _call_llm(self, system: str, messages: list, tools: list = None, max_turns: CLI providers let you reuse existing subscriptions instead of paying per-token, which is much cheaper when running many agents in parallel or doing heavy - Think/Reflect cycles. Trade-off: no native prompt caching, no tool-use protocol — - the CLI is used as a plain text-in / text-out oracle. + Think/Reflect cycles. Trade-off: no native prompt caching, no native tool-use + protocol — the LLM is driven purely as a text-in / text-out oracle, and tool + use is layered on top via the text protocol (see dispatch_worker). """ if self.provider == "claude_cli": return self._call_claude_cli(system, messages) @@ -289,12 +451,20 @@ def _run_cli(self, argv: list, prompt: str, tool_label: str, install_hint: str, return (result.stdout or "").strip() def _call_claude_cli(self, system: str, messages: list) -> str: - """Headless dispatch via the `claude` CLI, billed against a Pro / Max subscription.""" + """Headless dispatch via the `claude` CLI, billed against a Pro / Max subscription. + + `--tools ""` disables every built-in tool so the CLI degrades to a + pure text oracle. This is required for our protocol + to work: the CLI must be unable to act on its own, otherwise it + will bypass our ToolRegistry (and the loop loses visibility over + what actually happened, especially for launch_experiment PIDs). + + The prompt is piped via stdin to sidestep argv-size limits on + large conversation histories. + """ prompt = self._flatten_for_cli(system, messages) - # claude -p reads the prompt from stdin when no argument is given, - # which avoids argv size limits for long memory contexts. return self._run_cli( - argv=["claude", "-p", "--output-format", "text"], + argv=["claude", "-p", "--output-format", "text", "--tools", ""], prompt=prompt, tool_label="claude", install_hint="npm i -g @anthropic-ai/claude-code && run `claude` once to sign in", @@ -302,15 +472,74 @@ def _call_claude_cli(self, system: str, messages: list) -> str: ) def _call_codex_cli(self, system: str, messages: list) -> str: - """Headless dispatch via the `codex` CLI, billed against a ChatGPT subscription.""" + """Headless dispatch via the `codex` CLI, billed against a ChatGPT subscription. + + Unlike `claude -p`, `codex exec` is fully agentic by default — it runs + its own internal tool-use loop and there is no CLI flag to disable + the built-in tools. That means the framework's protocol + is unreliable under this provider: codex will often act on its own + and return a final summary. Workers that need to launch experiments + (and recover a PID from the ToolRegistry) should therefore prefer + claude_cli / anthropic / openai; codex_cli is best kept for the + leader/think path where we only need free-text output. + + Flags: + - `-o ` captures only the final assistant message + instead of the full agentic trace, + - `--skip-git-repo-check` allows codex to run in arbitrary dirs + (the workspace is typically not a repo). + """ + import subprocess + import tempfile + prompt = self._flatten_for_cli(system, messages) - return self._run_cli( - argv=["codex", "exec"], - prompt=prompt, - tool_label="codex", - install_hint="brew install codex (or see upstream project) then run `codex login`", - use_stdin=False, - ) + + try: + with tempfile.NamedTemporaryFile("w+", suffix=".txt", delete=False) as out: + out_path = out.name + try: + result = subprocess.run( + [ + "codex", "exec", + "--skip-git-repo-check", + "-o", out_path, + prompt, + ], + capture_output=True, + text=True, + timeout=600, + check=False, + ) + except FileNotFoundError: + logger.warning( + "codex CLI not found on PATH. " + "Install: brew install codex (or see upstream) then `codex login`. " + "Falling back to mock response." + ) + return json.dumps({"action": "wait", "reason": "codex CLI missing"}) + except subprocess.TimeoutExpired: + logger.error("codex CLI timed out after 600s") + return json.dumps({"action": "wait", "reason": "codex CLI timeout"}) + + if result.returncode != 0: + stderr_tail = (result.stderr or "").strip().splitlines()[-5:] + logger.error( + f"codex CLI exited {result.returncode}. " + f"Stderr tail: {' | '.join(stderr_tail)}" + ) + return json.dumps({"action": "wait", "reason": "codex CLI error"}) + + try: + with open(out_path, "r") as f: + return f.read().strip() + except OSError: + # Fall back to stdout if --output-last-message didn't produce a file. + return (result.stdout or "").strip() + finally: + try: + Path(out_path).unlink(missing_ok=True) + except OSError: + pass def _load_prompt(self, filename: str) -> str: """Load agent prompt from agents/ directory.""" @@ -358,16 +587,42 @@ def _parse_leader_response(self, response: str) -> dict: "task": response, } - def _parse_worker_response(self, response: str, agent_type: str) -> dict: - """Parse worker response into structured result.""" + def _parse_worker_response(self, response: str, agent_type: str, + tool_results: Optional[list] = None) -> dict: + """Parse worker response into a structured result dict. + + When the worker used the `launch_experiment` tool, the PID and + log_file come directly from that tool's JSON result — this is + authoritative. The regex-on-free-text path is retained as a + fallback for responses that report an experiment launch purely + in prose (or for older prompts that predate the tool-use loop). + """ result = {"agent": agent_type, "response": response} + if tool_results: + result["tool_calls"] = len(tool_results) - # Check for experiment launch indicators if agent_type == "code": - if "PID" in response or "launched" in response.lower(): + # Prefer authoritative tool-result data over text parsing. + launch_result = None + if tool_results: + for entry in reversed(tool_results): + if entry.get("name") == "launch_experiment": + launch_result = entry + break + if launch_result is not None: + try: + payload = json.loads(launch_result.get("output", "{}")) + except (json.JSONDecodeError, TypeError): + payload = {} + if isinstance(payload, dict) and payload.get("pid") is not None: + result["experiment_launched"] = True + result["pid"] = int(payload["pid"]) + if payload.get("log_file"): + result["log_file"] = payload["log_file"] + + # Fallback: scrape PID from free-text response. + if "pid" not in result and ("PID" in response or "launched" in response.lower()): result["experiment_launched"] = True - # Try to extract PID - import re pid_match = re.search(r"PID[=:\s]+(\d+)", response) if pid_match: result["pid"] = int(pid_match.group(1)) diff --git a/core/loop.py b/core/loop.py index 7131a57..e0d323c 100644 --- a/core/loop.py +++ b/core/loop.py @@ -208,7 +208,7 @@ def _execute(self, plan: dict) -> dict: result = self.dispatcher.dispatch_worker( agent_type=agent_type, task=task_description, - tools=self.tools.get_tools_for(agent_type), + tool_registry=self.tools, ) return result diff --git a/docs/architecture.md b/docs/architecture.md index 4234b81..e5e934e 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -96,7 +96,45 @@ No LLM API calls until training completes. - 4 tools = 800 extra tokens per call - Over 100 API calls/day, that's 220K tokens saved -### 6. GPU Utilities (`gpu/`) +### 6. Tool-Use Protocol (`core/agents.py::dispatch_worker`) + +Workers drive tool calls through a provider-agnostic text protocol rather +than each SDK's native tool-use API: + +1. The dispatcher renders the worker's tool schemas as a plain-text + `## Tool-Use Protocol` section and appends it to the system prompt. +2. The worker emits zero or more `{"name": "...", "args": {...}}` + blocks in its response. +3. For each block, the dispatcher calls `ToolRegistry.execute_tool` and + packages the JSON result into a `...` + block appended to the next user turn. +4. The loop iterates until the worker returns a message with no tool calls + (the final answer) or `max_turns` is reached. + +Design rationale: + +- **Uniform behaviour across four providers.** The same protocol works + whether the LLM is reached via the Anthropic SDK, the OpenAI SDK, the + `claude` CLI, or the `codex` CLI. The execution loop contains no + per-provider branching. +- **Authoritative experiment hand-off.** `pid` and `log_file` flow from + the `launch_experiment` tool result (structured JSON) to + `_parse_worker_response`, which promotes them onto the top-level result + dict read by `loop._monitor_experiment`. Regex-on-prose remains as a + fallback only. +- **CLI lock-down.** `claude_cli` is invoked with `--tools ""` so the + Claude Code CLI cannot bypass the protocol using its built-in tools. + `codex_cli` has no equivalent flag, so it may silently act on its own; + `dispatch_worker` logs a warning when `codex_cli` is used as a worker + provider, and the README compatibility table flags it accordingly. +- **Fence stripping.** Tool-call blocks inside triple-backtick code fences + are removed before parsing so that models illustrating the protocol in + their prose do not trigger real side-effectful tool execution. +- **Bounded execution.** `max_turns` is configured per-worker + (`idea=12`, `code=40`, `writing=30`); on overflow the loop exits cleanly + and the last response is returned with a warning. + +### 7. GPU Utilities (`gpu/`) - **detect.py**: Auto-detect GPUs, check availability, reserve last GPU - **keeper.py**: Keep cloud instances alive with minimal GPU activity diff --git a/tests/integration_cli_tool_use.py b/tests/integration_cli_tool_use.py new file mode 100644 index 0000000..2de0f86 --- /dev/null +++ b/tests/integration_cli_tool_use.py @@ -0,0 +1,58 @@ +"""Live integration check: drive a real CLI-provider worker end-to-end. + +Run manually: python -m tests.integration_cli_tool_use + +Burns one subscription round-trip per provider. Skipped automatically +if the CLI is not on PATH. Not wired into the normal unittest suite. +""" + +import shutil +import tempfile +from pathlib import Path + +from core.agents import AgentDispatcher +from core.tools import ToolRegistry + + +TASK = ( + "Your one job: create a file named hello.txt in the workspace " + "containing exactly the three-word sentence 'integration test ok', " + "then confirm by listing the files. Once done, reply with a short " + "success message and no further tool calls." +) + + +def _run(provider: str) -> dict: + binary = {"claude_cli": "claude", "codex_cli": "codex"}[provider] + if shutil.which(binary) is None: + return {"provider": provider, "skipped": f"{binary} not on PATH"} + + dispatcher = AgentDispatcher(provider=provider) + with tempfile.TemporaryDirectory() as tmp: + workspace = Path(tmp) + registry = ToolRegistry(workspace) + try: + result = dispatcher.dispatch_worker("writing", TASK, registry) + except Exception as exc: + return {"provider": provider, "error": repr(exc)} + + hello = workspace / "hello.txt" + return { + "provider": provider, + "tool_calls": result.get("tool_calls", 0), + "file_created": hello.exists(), + "file_content": hello.read_text() if hello.exists() else None, + "response_tail": (result.get("response", "") or "")[-200:], + } + + +def main(): + for provider in ("claude_cli", "codex_cli"): + print(f"\n=== {provider} ===") + outcome = _run(provider) + for k, v in outcome.items(): + print(f" {k}: {v}") + + +if __name__ == "__main__": + main() diff --git a/tests/test_tool_use_loop.py b/tests/test_tool_use_loop.py new file mode 100644 index 0000000..b6c295f --- /dev/null +++ b/tests/test_tool_use_loop.py @@ -0,0 +1,295 @@ +"""Tests for the worker tool-use loop in core/agents.py. + +The loop must: + - run multiple turns until the model stops emitting blocks, + - feed each tool's output back as a in the next user turn, + - respect max_turns as a hard ceiling, + - surface PID/log_file from a launch_experiment tool call so the EXECUTE + → MONITOR handoff in core/loop.py still works, + - ignore malformed JSON bodies without crashing. +""" + +import json +import tempfile +import unittest +from pathlib import Path +from unittest.mock import patch + +from core.agents import AgentDispatcher +from core.tools import ToolRegistry + + +def _make_dispatcher(): + # Provider choice is irrelevant because we stub _call_llm everywhere. + return AgentDispatcher(provider="anthropic") + + +class ParseToolCallsTests(unittest.TestCase): + def test_extracts_multiple_blocks_in_order(self): + text = """ + I will take two actions. + + {"name": "read_file", "args": {"path": "a.txt"}} + + Some prose in between. + + {"name": "write_file", "args": {"path": "b.txt", "content": "hi"}} + + """ + calls = AgentDispatcher._parse_tool_calls(text) + self.assertEqual([c["name"] for c in calls], ["read_file", "write_file"]) + self.assertEqual(calls[1]["args"]["path"], "b.txt") + + def test_empty_when_no_blocks(self): + self.assertEqual(AgentDispatcher._parse_tool_calls("final answer"), []) + + def test_skips_malformed_json(self): + text = """ + {"name": "ok", "args": {}} + {not valid json + "just a string" + """ + calls = AgentDispatcher._parse_tool_calls(text) + self.assertEqual(len(calls), 1) + self.assertEqual(calls[0]["name"], "ok") + + def test_ignores_tool_calls_inside_code_fences(self): + """LLMs often illustrate the protocol inside fenced blocks when + explaining themselves; those illustrations must NOT execute.""" + text = ''' + Here is how you would call the tool: + + ``` + + {"name": "write_file", "args": {"path": "DANGER.txt", "content": "pwned"}} + + ``` + + But for this task I do not need any tools. + ''' + self.assertEqual(AgentDispatcher._parse_tool_calls(text), []) + + def test_mix_of_fenced_illustration_and_real_call(self): + """A real top-level call must still be picked up even if the message + also contains an illustrative fenced example.""" + text = ''' + For reference, the general form looks like: + + ``` + + {"name": "write_file", "args": {"path": "EXAMPLE.txt", "content": "x"}} + + ``` + + Now I will do the real call: + + + {"name": "read_file", "args": {"path": "actual.txt"}} + + ''' + calls = AgentDispatcher._parse_tool_calls(text) + self.assertEqual(len(calls), 1) + self.assertEqual(calls[0]["name"], "read_file") + self.assertEqual(calls[0]["args"]["path"], "actual.txt") + + def test_tolerates_missing_args_key(self): + """A tool_call without an `args` field is valid; args default to {}.""" + text = '{"name": "list_files"}' + calls = AgentDispatcher._parse_tool_calls(text) + self.assertEqual(len(calls), 1) + self.assertEqual(calls[0]["name"], "list_files") + + def test_rejects_args_that_is_not_a_dict(self): + """If the LLM emits args as a string or list, the dispatcher must + surface a structured error rather than crashing on **kwargs.""" + dispatcher = _make_dispatcher() + registry = _FakeRegistry( + tools=[{"name": "read_file", "description": "", "input_schema": {}}], + outputs={}, + ) + turns = [ + '{"name": "read_file", "args": "not-a-dict"}', + "giving up", + ] + with patch.object(dispatcher, "_call_llm", side_effect=turns): + dispatcher.dispatch_worker("writing", "t", registry) + # The registry must NOT have been called with a non-dict args payload. + self.assertEqual(registry.calls, []) + + +class RenderToolsSectionTests(unittest.TestCase): + def test_renders_schema_properties(self): + tool = { + "name": "search_papers", + "description": "Search Semantic Scholar.", + "input_schema": { + "type": "object", + "properties": { + "query": {"type": "string", "description": "Search query"}, + "limit": {"type": "integer", "description": "Max results"}, + }, + "required": ["query"], + }, + } + rendered = AgentDispatcher._render_tools_section([tool]) + self.assertIn("search_papers", rendered) + self.assertIn("", rendered) + self.assertIn("query", rendered) + self.assertIn("required", rendered) + self.assertIn("Max results", rendered) + + def test_empty_when_no_tools(self): + self.assertEqual(AgentDispatcher._render_tools_section([]), "") + + +class _FakeRegistry: + """Minimal ToolRegistry stub that records calls and returns canned outputs.""" + + def __init__(self, tools, outputs): + self._tools = tools + self._outputs = outputs # dict: tool_name -> json string + self.calls: list[tuple[str, dict]] = [] + + def get_tools_for(self, agent_type): + return self._tools + + def execute_tool(self, name, args): + self.calls.append((name, args)) + return self._outputs.get(name, json.dumps({"ok": True})) + + +class DispatchWorkerLoopTests(unittest.TestCase): + def test_none_registry_raises_clear_typeerror(self): + """Passing tool_registry=None must fail at the boundary with a + clear TypeError, not with a cryptic AttributeError deep in the + loop. External callers of dispatch_worker will hit this edge.""" + dispatcher = _make_dispatcher() + with self.assertRaises(TypeError) as ctx: + dispatcher.dispatch_worker("writing", "task", None) + self.assertIn("tool_registry", str(ctx.exception)) + self.assertIn("get_tools_for", str(ctx.exception)) + + def test_unknown_agent_type_raises_before_touching_registry(self): + """Agent-type validation should happen first so that a bad + agent_type produces ValueError regardless of registry state.""" + dispatcher = _make_dispatcher() + with self.assertRaises(ValueError): + dispatcher.dispatch_worker("bogus_agent", "task", None) + + def test_terminates_when_response_has_no_tool_calls(self): + dispatcher = _make_dispatcher() + registry = _FakeRegistry(tools=[], outputs={}) + + with patch.object(dispatcher, "_call_llm", return_value="all done, no tools"): + result = dispatcher.dispatch_worker("writing", "task", registry) + + self.assertEqual(registry.calls, []) + self.assertEqual(result["agent"], "writing") + self.assertIn("all done", result["response"]) + + def test_executes_tools_and_feeds_results_back(self): + dispatcher = _make_dispatcher() + fake_tools = [{"name": "read_file", "description": "read", + "input_schema": {"type": "object", "properties": {}}}] + registry = _FakeRegistry( + tools=fake_tools, + outputs={"read_file": json.dumps({"content": "file body"})}, + ) + + turns = [ + '{"name": "read_file", "args": {"path": "a.txt"}}', + "Done reading, here is my summary.", + ] + call_log: list[list] = [] + + def fake_call(system, messages): + call_log.append(list(messages)) + return turns.pop(0) + + with patch.object(dispatcher, "_call_llm", side_effect=fake_call): + result = dispatcher.dispatch_worker("writing", "task", registry) + + # One tool call was executed with the right args. + self.assertEqual(registry.calls, [("read_file", {"path": "a.txt"})]) + # Second LLM turn saw the assistant's tool_call and a user tool_result. + self.assertEqual(len(call_log), 2) + second_turn_messages = call_log[1] + self.assertEqual(second_turn_messages[0]["role"], "user") # original task + self.assertEqual(second_turn_messages[1]["role"], "assistant") # tool call echo + self.assertEqual(second_turn_messages[2]["role"], "user") # tool_result block + self.assertIn("{"name": "launch_experiment", ' + '"args": {"command": "python train.py", "log_file": "exp.log"}}' + '', + # Deliberately give a prose reply that lies about the PID — tool result should win. + "Training started, PID=99999 (this number is wrong).", + ] + + with patch.object(dispatcher, "_call_llm", side_effect=turns): + result = dispatcher.dispatch_worker("code", "launch it", registry) + + self.assertTrue(result["experiment_launched"]) + self.assertEqual(result["pid"], 4321) + self.assertEqual(result["log_file"], "/tmp/exp.log") + + def test_end_to_end_with_real_registry(self): + """Smoke test with a real ToolRegistry and a temp workspace.""" + dispatcher = _make_dispatcher() + with tempfile.TemporaryDirectory() as tmp: + workspace = Path(tmp) + registry = ToolRegistry(workspace) + + turns = [ + '{"name": "write_file", ' + '"args": {"path": "note.txt", "content": "hello"}}', + '{"name": "read_file", ' + '"args": {"path": "note.txt"}}', + "I wrote and read the file successfully.", + ] + + with patch.object(dispatcher, "_call_llm", side_effect=turns): + result = dispatcher.dispatch_worker("writing", "do it", registry) + + self.assertEqual(result["tool_calls"], 2) + self.assertTrue((workspace / "note.txt").exists()) + self.assertEqual((workspace / "note.txt").read_text(), "hello") + + +if __name__ == "__main__": + unittest.main()