Reproduce the benchmark

By Parsa Khazaeepoul, co-founder of Pane. Tested every agent manager in this comparison set in production. Last reviewed May 2026.

The whole kit is public. If a number on the Q2 run page looks wrong, you should be able to re-run it on your hardware in under an afternoon and either reproduce the number or open a correction issue with your delta. This page walks through how.

what you'll need

→ a modern laptop — any machine shipped in the last three years that you can run an Electron app on.
→ macOS, Windows, or Linux — bash scripts for macOS and Linux; PowerShell scripts for Windows. cmux is macOS-only; that's recorded as N/A on Windows and Linux rows.
→ an Anthropic API key — the benchmark pins claude-sonnet-4-6 at temperature 0. Other models will produce different workflow numbers.
→ git and pnpm — the kit is a 5-package pnpm workspace. corepack enable on a fresh machine.
→ about three hours for a full sweep across the seven measured managers on one platform.

clone the kit

gh repo clone dcouple/agent-manager-benchmark
cd agent-manager-benchmark
pnpm install

The repository's packages/ directory holds the five target packages — small, realistic TypeScript files with console.log calls scattered through them. The fixed task spec in TASK.md is what each manager will be given.

run the resource measurements

Open your target manager, spawn four parallel panes or workspaces, and wait for each to be ready. Then, from the kit's root:

# macOS / Linux
./scripts/measure-resource.sh pane

# Windows (PowerShell)
.\scripts\measure-resource.ps1 pane

The script prompts for the launcher PID, capture-process-set.sh walks descendants via pgrep -P, and the totals are written to runs/<run-id>/<manager>.json. Repeat for each manager installed on your platform.

run the workflow measurements

Each manager has its own keystroke script in workflow/<manager>.md. Open the file, start your screen recorder, and follow the steps exactly. Every line ends with [click] or [key] so you can count.

→ screen recording is the receipt. An operator might miscount; the recording proves the number.
→ stopwatch for tta-status where the manager has no event hook for "agent paused." The methodology page explains this honestly.
→ five trials per cell. Median, min, and max go to the JSON output.

submitting your results

Open a PR against the kit repo with your runs/<run-id>/ directory. Disagreements with a number on a published run page go through the correction template; disputes with a methodology rule use the methodology-question template.

Methodology details live on the methodology page. Latest published numbers are on /benchmarks/2026-q2.