By Parsa Khazaeepoul, co-founder of Pane. Methodology pre-registered before measurements. .
These rules are published before the first set of measurements. The page you're reading was timestamped before any number on /benchmarks/2026-q2 was captured — pre-registration is the credibility play. If a metric moves against Pane on a quarterly run, we publish it. The public kit is the receipt.
Vendor benchmarks earn skepticism because the rules tend to follow the result. Pre-registering the metrics, the process-set rule, the trial count, and the statistical handling before any data exists forces the story to follow the data. It also gives competitors and the community a target to argue with on day one rather than after the fact.
Three categories. Two appear in the first run; capability is already addressed on the compare pages and is not duplicated here.
memory at N=4 parallel agents (MB)
Total resident set size (RSS) of the launcher process and every descendant process measured 60 seconds after four panes have been spawned and each has a Claude agent actively producing tokens. RSS for the process set is the sum, not the max. Captured via ps -o rss= on macOS/Linux and Get-Process | WorkingSet64 on Windows.
disk overhead per worktree (x)
Disk usage of the manager's worktree directory divided by the disk usage of the source repository, both measured with du -sh. Reported as a multiplier. Git-worktree-based managers should sit near 1x; copy-checkout approaches near 3x.
cold-start time (ms)
Wall-clock milliseconds from the application launch command to the first moment the app accepts input (cursor in the first text field, or the first interactive command palette response). Operator-visual with screen recording where the app exposes no programmatic ready event.
steps from task to PR opened (actions)
Total count of keystrokes and clicks the operator performs from a fresh app state to a PR open in the browser. The task is fixed: replace every console.log in five packages of the kit'slogger-{a..e} packages with a logger.info helper. Counted from the workflow script for each manager; screen recording is the receipt.
time-to-status-awareness (ms)
Milliseconds from the agent emitting its last token to a user-visible signal: notification, color change, sound. Where no event hook exists, this is operator-visual with a stopwatch and screen recording. The methodology is honest about which managers require manual timing.
Many managers spawn child processes — Conductor has a daemon plus helper procs; Electron-based apps have a main process plus renderer plus per-pane shells. Memory is the total RSS of the launcher PID and all descendants. The kit ships scripts/capture-process-set.sh (and its PowerShell counterpart) which walks pgrep -P recursively and sums the resident set. Sampling only the foreground GUI process is the single most common way to make a benchmark lie.
N=5 trials per cell. We publish median, min, and max for every measurement. Median over mean because agent latency has long tails and a single stalled trial can lie. Min/max bracket because at N=5 the standard deviation is harder to read than the raw range. The per-trial JSON is in runs/<run-id>/<manager>.json in the kit repo.
Every workflow measurement runs against claude-sonnet-4-6 at temperature 0 with the same fixed system prompt (in the kit's TASK.md). Pinning the agent removes a major source of cross-trial variance and lets the benchmark measure the manager rather than the model's sampling noise. Resource measurements still depend on the agent's output rate, which is why we report them as a process-set sample rather than peak.
Three platforms: macOS 14.x, Windows 11, Ubuntu 24.04. Each platform gets its own table on the run page. Managers that don't support a platform appear with N/A in that platform's column — we don't cross-platform-mix results, and we don't pretend a tool we couldn't install is comparable to one we did. cmux is macOS-only and shows N/A for Windows and Linux. Crystal and Claude Squad are tmux layers — they run on macOS and Linux but not Windows. Conductor is Mac-only as of the current release.
Capability — "does this manager support feature X" — is a boolean, not a number, and the compare pages already host those tables. Visual polish is subjective; we cite it on compare pages as a qualitative note but don't score it. Vendor roadmap commitments are deliberately excluded — features that ship next quarter are next quarter's benchmark.
When Pane loses a row, the row is published exactly as measured. The verdict text names the winner. We don't move metric definitions between runs to make a category go our way. If a metric is genuinely broken — measures something other than what it claims — we deprecate it across all managers in the next run with a footnote that explains why. The first-run page's placeholder verdicts make this concrete: every row will read "winner: X" when the data lands, and X will not always be Pane.
Open a correction issue on the kit repo. Disputes about the methodology itself — "this rule isn't fair to my tool" — use the methodology-question template instead. Numbers updates route to a re-run; methodology updates route to the next quarter's pre-registration.
Want to re-run this on your hardware? See the reproduce page.