ai-agents-metrics

F-004 — Cross-thread file-rework signal exists, but N=66 is too small for effectiveness claims

Dataset: 66 Claude Code threads with ≥1 Edit / Write / MultiEdit event, 2026-04-19 measurement on warehouse-full.sqlite.

TL;DR

Following F-003 — which ruled out naive practice-effectiveness split — we tested a more robust outcome variable: does a thread’s edit to file X come back as a follow-up edit to X in a later thread within 30 days? The signal is real (61% of implementation threads have a rework follow-up) and measurable without retry variance. But at N=66 threads, practice-effectiveness differences are within noise — we cannot distinguish “code-review reduces rework by 13%” from “no effect.”

Setup

An implementation thread is a Claude thread with ≥1 Edit, Write, or MultiEdit tool_use (66 of 88 Claude threads, 75%). For each such thread we extract the set of file paths touched. A rework chain exists between threads A and B if A and B touched ≥1 common file and B started 1h–30d after A.

Result — signal exists

Dimension Count
Implementation threads 66
Distinct files touched 408
Files touched by ≥2 threads 76 (18.6%)
Thread-pairs sharing ≥1 file 308
Rework chains (gap 1h–30d) 292
Implementation threads with ≥1 rework follow-up 40 (61%)
Median gap (days) between original and rework 3.0
Median shared files per pair 1

61% of implementation threads have a downstream thread that re-edited the same file within a month. This is enough variance to be an outcome variable, AND it does not depend on retry structural signal (which is zero on this dataset per F-001).

Result — practice-effect is within noise

Applying the same practice-splits as F-003, but to rework_rate (reworked_files / files_touched):

Split Any-rework rework_rate mean Interpretation
code_review WITH (n=18) 56% 0.281 6pp lower, 13% relative reduction
code_review WITHOUT (n=48) 62% 0.323 baseline
discovery WITH (n=15) 67% 0.352 higher than without
discovery WITHOUT (n=51) 59% 0.300 baseline

Honest interpretation: practice-effect is not distinguishable from noise at this sample size. Both directions are compatible with the data.

What this means for ambitions

Caveats and known confounds