Is this benchmark independent?

No. We make Pane. The credibility backstop is the public kit repo at github.com/dcouple/agent-manager-benchmark — anyone can clone it, re-run on their hardware, and open a correction PR. Raw logs ship as JSON next to the run page so every number is auditable. If we lose a row, we lose it on the page.

Why median over mean?

Agent latency has long tails. One stalled trial can move a mean by hundreds of milliseconds while the median stays flat. We publish median plus min and max — that's three numbers per cell instead of one, and it's harder to hide an outlier inside.

Without a pinned agent and temperature, every trial measures both the manager AND the agent's randomness. Pinning claude-sonnet-4-6 at temperature 0 against a fixed system prompt removes that confound. The benchmark is measuring the manager; the agent is held constant.

What happens if a vendor disputes a number?

Open a correction issue on the kit repo. Cite the row, the manager, the platform, your expected value, and your evidence (logs, screen recording, methodology delta). We re-run, update the page if your evidence holds up, and link your issue from the vendor-responses section.

How often does this run?

Quarterly. Each run gets its own page at /benchmarks/{year}-{quarter}. Old runs stay published; we don't rewrite history when numbers change.

Can I add a manager not in this list?

Yes. Open a PR against the kit repo adding the manager's workflow script, a measure-resource entry, and the runs/ / .json file. We merge if the methodology is honored and add the manager to the next run.

← back to benchmarks

Benchmark methodology

By Parsa Khazaeepoul, co-founder of Pane. Methodology pre-registered before measurements. Last reviewed May 2026.

These rules are published before the first set of measurements. The page you're reading was timestamped before any number on /benchmarks/2026-q2 was captured — pre-registration is the credibility play. If a metric moves against Pane on a quarterly run, we publish it. The public kit is the receipt.

why we publish methodology before measurements

Vendor benchmarks earn skepticism because the rules tend to follow the result. Pre-registering the metrics, the process-set rule, the trial count, and the statistical handling before any data exists forces the story to follow the data. It also gives competitors and the community a target to argue with on day one rather than after the fact.

what we measure

Three categories. Two appear in the first run; capability is already addressed on the compare pages and is not duplicated here.

→ resource cost — memory at N parallel agents, disk overhead per worktree, cold-start time.
→ workflow friction — steps from task to PR opened, time from agent-paused to user-aware.
→ capability — covered on compare pages with feature tables; not re-measured here.

metric definitions

memory at N=4 parallel agents (MB)

Total resident set size (RSS) of the launcher process and every descendant process measured 60 seconds after four panes have been spawned and each has a Claude agent actively producing tokens. RSS for the process set is the sum, not the max. Captured via ps -o rss= on macOS/Linux and Get-Process | WorkingSet64 on Windows.

disk overhead per worktree (x)

Disk usage of the manager's worktree directory divided by the disk usage of the source repository, both measured with du -sh. Reported as a multiplier. Git-worktree-based managers should sit near 1x; copy-checkout approaches near 3x.

cold-start time (ms)

Wall-clock milliseconds from the application launch command to the first moment the app accepts input (cursor in the first text field, or the first interactive command palette response). Operator-visual with screen recording where the app exposes no programmatic ready event.

steps from task to PR opened (actions)

Total count of keystrokes and clicks the operator performs from a fresh app state to a PR open in the browser. The task is fixed: replace every console.log in five packages of the kit'slogger-{a..e} packages with a logger.info helper. Counted from the workflow script for each manager; screen recording is the receipt.

time-to-status-awareness (ms)

Milliseconds from the agent emitting its last token to a user-visible signal: notification, color change, sound. Where no event hook exists, this is operator-visual with a stopwatch and screen recording. The methodology is honest about which managers require manual timing.

process-set rule

Many managers spawn child processes — Conductor has a daemon plus helper procs; Electron-based apps have a main process plus renderer plus per-pane shells. Memory is the total RSS of the launcher PID and all descendants. The kit ships scripts/capture-process-set.sh (and its PowerShell counterpart) which walks pgrep -P recursively and sums the resident set. Sampling only the foreground GUI process is the single most common way to make a benchmark lie.

trial count + statistics

N=5 trials per cell. We publish median, min, and max for every measurement. Median over mean because agent latency has long tails and a single stalled trial can lie. Min/max bracket because at N=5 the standard deviation is harder to read than the raw range. The per-trial JSON is in runs/<run-id>/<manager>.json in the kit repo.

agent + model pinning

Every workflow measurement runs against claude-sonnet-4-6 at temperature 0 with the same fixed system prompt (in the kit's TASK.md). Pinning the agent removes a major source of cross-trial variance and lets the benchmark measure the manager rather than the model's sampling noise. Resource measurements still depend on the agent's output rate, which is why we report them as a process-set sample rather than peak.

platform scoping

Three platforms: macOS 14.x, Windows 11, Ubuntu 24.04. Each platform gets its own table on the run page. Managers that don't support a platform appear with N/A in that platform's column — we don't cross-platform-mix results, and we don't pretend a tool we couldn't install is comparable to one we did. cmux is macOS-only and shows N/A for Windows and Linux. Crystal and Claude Squad are tmux layers — they run on macOS and Linux but not Windows. Conductor is Mac-only as of the current release.

what we don't measure

Capability — "does this manager support feature X" — is a boolean, not a number, and the compare pages already host those tables. Visual polish is subjective; we cite it on compare pages as a qualitative note but don't score it. Vendor roadmap commitments are deliberately excluded — features that ship next quarter are next quarter's benchmark.

how we handle pane losing rows

When Pane loses a row, the row is published exactly as measured. The verdict text names the winner. We don't move metric definitions between runs to make a category go our way. If a metric is genuinely broken — measures something other than what it claims — we deprecate it across all managers in the next run with a footnote that explains why. The first-run page's placeholder verdicts make this concrete: every row will read "winner: X" when the data lands, and X will not always be Pane.

how to submit corrections

Open a correction issue on the kit repo. Disputes about the methodology itself — "this rule isn't fair to my tool" — use the methodology-question template instead. Numbers updates route to a re-run; methodology updates route to the next quarter's pre-registration.

frequently asked questions

Want to re-run this on your hardware? See the reproduce page.