Dataset: 88 Claude Code threads, 3.85B tokens, 2026-04-19 measurement on warehouse-full.sqlite.
The obvious way to measure “does practice X help?” is to split threads into those that used X and those that didn’t, then compare outcomes. On this dataset, this split shows a 20× token gap and 22× duration gap between the two groups — but it is almost entirely a task-size confound. The practice fires on threads big enough to need it, not on threads where it would help.
We wanted to test H-015: do AI-collaboration practices (code review, discovery/Explore, QA-pass) correlate with better outcomes (less tokens, shorter duration, fewer retries)?
Method: split 88 Claude threads by whether they had ≥1 Skill:code-review or Agent:pr-review-toolkit:code-reviewer event, compare outcome distributions.
| Split | With practice (n) | Without practice (n) | Median ratio (with / without) |
|---|---|---|---|
| code_review × total_tokens | 50M (18) | 2.5M (70) | 20× |
| code_review × output_tokens | 164k (18) | 10k (70) | 17× |
| code_review × duration | 27,500 s (18) | 1,300 s (70) | 22× |
| code_review × main_sessions | 1 (18) | 1 (70) | 1× |
| code_review × subagent_sessions | 5 (18) | 0 (70) | n/a (both in single digits) |
discovery (Explore) × total_tokens |
13M (15) | 2.5M (73) | 5× |
| discovery × duration | 23,400 s (15) | 1,300 s (73) | 17× |
Reading this table as “code-review reduces rework!” would be wrong. Reading it as “code-review causes 20× more tokens!” would be equally wrong. The right reading is:
Practice-presence and task size are correlated because the practice is invoked when it’s worth invoking. Code-review fires on threads that produced code substantial enough to warrant review. Explore fires on tasks complex enough that the agent decides it needs to look around first. Trivial threads — “what does this error mean?”, “format this date” — never trigger either practice, because there’s nothing to review or explore.
So the 20× gap is measuring task complexity stratification, not practice outcome.
The split is not comparing “users who did X” vs “users who didn’t” — it’s comparing “tasks that needed X” vs “tasks that didn’t”.
None of these are free; they are listed in order of “cheapest that might work”:
Explore fires in the first 20% of a thread, is the remaining 80% more efficient (lower tokens-per-Edit, shorter time-to-commit) than threads where Explore never fired early? Task size is held constant per thread.All four need more methodology than a split comparison. All four still need enough practice-present N to get confidence intervals — on this dataset, with 18 code-review threads and 15 discovery threads, effect-size confidence bounds are wide enough that only large effects would be detectable.
After shipping the derived_practice_events table (Agent + Skill tool_use extractor), we re-ran the split on a larger warehouse — 160 threads instead of 88 — with size buckets by message count.
| Size bucket (messages) | n_with | n_without | median total tokens with | median total tokens without | ratio |
|---|---|---|---|---|---|
| XS (≤20) | 9 | 35 | 1.1M | 0.4M | 2.9× |
| S (21–50) | 10 | 31 | 6.6M | 2.6M | 2.5× |
| M (51–100) | 7 | 17 | 20.0M | 7.9M | 2.5× |
| L (101–200) | 10 | 21 | 43.3M | 18.8M | 2.3× |
| XL (>200) | 10 | 10 | 115.3M | 102.2M | 1.1× |
Two things happen when you size-match:
Agent does, so this component is definitional, not an “inefficiency.”So the honest statement is: same-size threads that invoke practices spend ~2.5× more tokens, about half of which is the subagent overhead the practice itself creates, and about half is heavier main-session context-per-turn. The XL bucket’s 1.1× ratio hints that at large thread sizes this overhead saturates — but n=10/10 is too small to claim that.
This still is not an effectiveness measurement. We have controlled for thread size, not for task difficulty or outcome quality. “Did the practice produce a better result?” remains unanswered here; all we have shown is “size-matching does not explain the gap away.”