By Parsa Khazaeepoul, co-founder of Pane. Tested every agent manager in this comparison set in production. .
The whole kit is public. If a number on the Q2 run page looks wrong, you should be able to re-run it on your hardware in under an afternoon and either reproduce the number or open a correction issue with your delta. This page walks through how.
claude-sonnet-4-6 at temperature 0. Other models will produce different workflow numbers.corepack enable on a fresh machine.gh repo clone dcouple/agent-manager-benchmark
cd agent-manager-benchmark
pnpm installThe repository's packages/ directory holds the five target packages — small, realistic TypeScript files with console.log calls scattered through them. The fixed task spec in TASK.md is what each manager will be given.
Open your target manager, spawn four parallel panes or workspaces, and wait for each to be ready. Then, from the kit's root:
# macOS / Linux
./scripts/measure-resource.sh pane
# Windows (PowerShell)
.\scripts\measure-resource.ps1 paneThe script prompts for the launcher PID, capture-process-set.sh walks descendants via pgrep -P, and the totals are written to runs/<run-id>/<manager>.json. Repeat for each manager installed on your platform.
Each manager has its own keystroke script in workflow/<manager>.md. Open the file, start your screen recorder, and follow the steps exactly. Every line ends with [click] or [key] so you can count.
Open a PR against the kit repo with your runs/<run-id>/ directory. Disagreements with a number on a published run page go through the correction template; disputes with a methodology rule use the methodology-question template.
Methodology details live on the methodology page. Latest published numbers are on /benchmarks/2026-q2.