← back to blog
They Gated the Fix. We Open-Sourced the Whole Harness.

They Gated the Fix. We Open-Sourced the Whole Harness.

Parsa · 2026-03-31

The Claude Code leak keeps giving.

Earlier today I wrote about the cobbler's children - the code quality findings. 460 eslint-disables. 4,683-line files. TODO comments that don't understand their own errors. That post was about what Anthropic didn't do. This one is about what they did do - and kept for themselves.

Someone reverse-engineered the leaked source against their own agent logs and found something worse than messy code. They found that Anthropic knows exactly where Claude Code breaks - hallucination, context decay, lazy fixes, silent truncation - and built fixes for all of it. Then they gated those fixes behind process.env.USER_TYPE === 'ant'. Anthropic employees get the working version. You get the version that reports "Done!" after introducing 40 type errors.

Here's the part that hit me: every single problem they identified, and every fix they gated, maps directly to something we figured out independently over the last six months. Not because we had access to their source code. Because we ran thousands of commits through the pipeline, hit every failure mode, and encoded the fix into the harness. The difference is we open-sourced all of it.

- - - - - - - - - - - - - - - -

the verification gate we already have

The biggest finding from the reverse-engineering: Claude Code's success metric for a file write is "did bytes hit disk?" Not "does the code compile." Not "did I introduce type errors." Just: did the write operation complete? The source contains explicit instructions telling the agent to verify its work - run tests, check output, confirm correctness. Those instructions only activate for Anthropic employees.

Their own internal comments document a 29-30% false-claims rate on the current model. They know almost a third of "Done!" messages are lies. They built the fix. They kept it.

We built the same fix six months ago.

Our implementer agent runs typecheck → lint → format after every major section of work. The implementation reviewer runs it again after all work is done. The prepare-pr skill runs it before creating the pull request. Three checkpoints. Same commands every time. pnpm run typecheck && pnpm run lint --max-warnings=0. If the agent introduces a type error in step 3, the loop catches it before step 4. The error never compounds.

This isn't a CLAUDE.md override. It's not a prompt injection hoping the agent listens. It's a mechanical invariant baked into the skill definitions that run automatically every session. The agent can't report success until the gates pass. Not because we asked it nicely. Because the harness won't let it proceed.

The tweet's suggested fix: "In your CLAUDE.md, make it non-negotiable: after every file modification, run npx tsc --noEmit and npx eslint . --quiet." That's the right idea. It's also the weakest possible implementation. A CLAUDE.md directive is a suggestion. The agent might follow it. It might not. It might follow it for 10 messages and then forget after compaction wipes the context. A skill definition that runs the commands as a gate - where the workflow literally stops if they fail - is a guarantee. The difference between "please verify" and "you cannot proceed until verification passes" is the difference between a suggestion and infrastructure.

- - - - - - - - - - - - - - - -

context decay is a codebase problem

The second finding: auto-compaction fires at ~167,000 tokens. When it does, it keeps 5 files capped at 5K tokens each, compresses everything else into a 50,000-token summary, and throws away every file read, every reasoning chain, every intermediate decision. Gone.

The tweet frames this as a Claude Code limitation you route around by keeping tasks small. That's true. But it misses the deeper point.

A dirty codebase accelerates compaction. Every dead import, every unused export, every orphaned prop, every 4,683-line file is eating tokens that contribute nothing to the task but everything to triggering the compaction threshold. The tweet says "Step 0 of any refactor must be deletion." Correct. But Step 0 should have happened months ago, on every file, as part of the continuous maintenance that keeps the codebase agent-readable.

This is what I wrote about in the last post. @typescript-eslint/no-explicit-any: 'error'. Path aliases over relative imports. Shared types in a shared package. One way to do everything. Thin pages with orchestration hooks. These aren't aesthetic preferences. They're token efficiency measures. A 200-line file with one responsibility costs a fraction of the context window compared to a 4,683-line file where the agent needs 20 lines. The other 4,663 lines are pure waste - tokens that push you closer to compaction without contributing to the task.

Our linter commands spawn parallel agents to fix type errors, clean unused imports, optimize React hooks, all in one pass. We run them regularly. Not when a refactor starts. Continuously. The codebase stays clean between tasks so the token budget is available when the tasks begin.

The tweet says "keep each phase under 5 files so compaction never fires mid-task." That's a band-aid. The fix is a codebase where 5 files is all the agent ever needs to load because each file is focused, small, and self-contained.

- - - - - - - - - - - - - - - -

the brevity mandate we already override

The third finding: Claude Code's system prompts contain explicit instructions that fight your intent. "Try the simplest approach first." "Don't refactor code beyond what was asked." "Three similar lines of code is better than a premature abstraction." The tweet calls these "a brevity mandate" that produces band-aid fixes instead of architectural solutions.

This one made me smile. The tweet's suggested override: "Ask: 'What would a senior, experienced, perfectionist dev reject in code review? Fix all of it. Don't be lazy.'"

That's a prompt. It works sometimes. Here's what works every time.

Our review system runs 11 principle-specific review agents in parallel. Not one prompt asking the agent to be thorough. Eleven separate agents, each checking a specific dimension: single responsibility, code reuse, readability, scope, anti-patterns, architecture patterns, frontend standards, error handling, state management, performance, and security. The one labeled "single way to do things" is flagged as the most critical.

On top of that, we run a Codex adversarial review loop. Different model. Different training data. Completely different opinions about code quality. Claude is too generous with its own code - this is the brevity mandate in action, Claude reviewing Claude's minimal output and finding it acceptable. Codex will straight up tell you "this is bad code" in a way Claude won't.

The system prompt says "try the simplest approach." Our plan template says "no aspirations, only instructions. Replace completely, don't patch." The /plan skill has a confidence scale of 1-10. Below 8, keep iterating. The plan reviewer checks for open questions, scope issues, and vague instructions. Only when the plan is airtight does /implement run.

You don't override the brevity mandate with a prompt. You override it with a pipeline that makes the agent think thoroughly before it writes a single line, and then reviews the output from eleven angles plus a competing model before it ships.

- - - - - - - - - - - - - - - -

the agent swarm we already built the cockpit for

The fourth finding: Claude Code has a multi-agent orchestration system built in. Parallel sub-agents, each with their own isolated context and token budget. No hardcoded worker limit. Anthropic built it and never surfaced it. The tweet recommends forcing sub-agent deployment by batching files into groups of 5-8 and launching in parallel.

This is Pane.

Not conceptually. Literally. This is what Pane does.

Each pane is an agent session with its own git worktree, its own context, its own terminal. Three panes, three worktrees, three agents, all running simultaneously on the same repo. Ctrl+Up/Down to cycle between them. Session persistence means you close your laptop, reopen, everything's still running. The agents don't share context, so compaction in pane 1 doesn't affect pane 2. Each agent has the full 167K token budget.

The tweet suggests using Claude Code's hidden orchestrator mode. We built the orchestrator as the product. Any agent. Any number of parallel sessions. Each isolated. Each persistent. Each with the full harness - AGENTS.md, CLAUDE.md, lint configs, skill definitions - inherited automatically from the repo root via worktrees.

If it runs in a terminal, it runs in Pane. You're not limited to Claude Code's internal orchestrator. You can run Claude in one pane, Codex in another, Aider in a third. Different agents for different tasks. Different models with different strengths. The swarm isn't hidden behind a build flag. It's the product.

- - - - - - - - - - - - - - - -

the 2,000-line blind spot that shouldn't exist

The fifth finding: each file read is hard-capped at 2,000 lines / 25,000 tokens. Everything past that is silently truncated. The agent hallucinates the rest.

The tweet's fix: "Any file over 500 LOC gets read in chunks using offset and limit parameters."

Our fix: files over 500 LOC shouldn't exist.

I wrote in the last post about thin pages and orchestration hooks. Page files are JSX composition only. All logic lives in hooks. Our architecture review agent checks this. A 200-line file with one responsibility is a unit an agent can hold in its head. There's nothing to truncate because there's nothing past the cap.

Anthropic's own main.tsx is 4,683 lines. Their agent can only read 2,000 of them at a time. They built the tool that has a 2,000-line read cap, then built a 4,683-line file inside the tool. Two thirds of their own main entry point is invisible to their own agent in a single read.

This is the cobbler's children all over again. The read cap is a known constraint. The architectural response is files small enough that the cap never matters. Instead of building small files, they built a workaround recommendation. Instead of fixing the codebase, they're telling you to fix yours.

- - - - - - - - - - - - - - - -

what we actually learned

Every finding in this reverse-engineering thread maps to something we figured out independently. Not because we're smarter than Anthropic's engineers. Because we ran the pipeline every day for six months and hit every wall.

The tweet's CLAUDE.md override is 10 directives. Good directives. They'll help. But they're prompts. The agent might follow them. It might forget them after compaction. They have no enforcement mechanism.

Our harness is 53 files in the .claude directory. 5 agent definitions. 8 skills with their own SKILL.md files. 34 slash commands. A plan template with built-in validation gates. 11 parallel review agents. Cross-model adversarial review. Mechanical invariants that are part of the build system, not part of the prompt.

Anthropic gated the fix behind an employee flag. We open-sourced the whole harness.

- - - - - - - - - - - - - - - -

the real lesson

I keep coming back to the same insight: the harness matters more than the agent.

Anthropic has the best model. They have internal access to features nobody else can use. They have employee-only verification loops. And their own codebase still has 460 eslint-disables and TODO comments that don't understand their own errors.

We don't have internal access to anything. We use the same Claude Code everyone else does, on the same $200/month Max plan. But we built the harness around it - the skills, the agents, the review system, the mechanical invariants, the monorepo structure - and the output quality is fundamentally different.

The tweet's author did valuable work. They identified real problems and proposed real fixes. But the fixes are CLAUDE.md overrides - instructions you paste into a file and hope the agent follows. That's a starting point, not a solution.

The solution is the pipeline. /discussion until you understand the problem. /plan until the spec is airtight. /implement with built-in verification gates and multi-model review. The harness runs the same way every session. It doesn't get compacted. It doesn't forget directives. It doesn't depend on the agent reading a markdown file and choosing to comply.

Anthropic proved that knowing the problems isn't enough. They documented a 29-30% false-claims rate and gated the fix. We found the same 29-30% failure rate empirically and built the fix into the pipeline so it runs automatically, for everyone, on every commit.

The commands, the skills, the agent definitions, the review system - all of it is in the repo. Open source. Copy it, adapt it, use it. You don't need employee access. You need the harness.

- - - - - - - - - - - - - - - -

All of our Claude Code commands, skills, agent definitions, AGENTS.md, and CLAUDE.md are open source: github.com/Dcouple-Inc/Pane/.claude

Previous posts in this series: