ai-agents-metrics

Findings

Measurements from running ai-agents-metrics against a single developer’s 6-month Claude Code + Codex history (3.85B tokens, 160 threads, 334 sessions). Each finding is surprising or non-obvious enough that future analyses of AI-agent history should know about it.

All numbers are verifiable: the pipeline is deterministic and the warehouse schema is documented in warehouse-layering.md and data-schema.md.

Index

# Finding Headline number
F-001 100% of Claude “retries” are subagent spawns, not user retries 160 / 160 threads have main_attempt_count = 1
F-002 Claude’s role='user' is mostly not human-typed 86.7% of role='user' events are tool_result or template
F-003 Naive practice-effectiveness split is size-confounded 20× naive gap collapses to ~2.5× after size-matching; half is subagent overhead
F-004 Cross-thread file-rework signal is detectable but N=66 too small for claims 61% of implementation threads have a rework follow-up; practice effect is within noise
F-005 Discovery and commit-automation dominate real AI-coding practice usage 29% of threads use any practice; discovery=39% of events, code_review=14%
F-006 AI-agent retros overwhelmingly describe meta-tooling failures, not code failures 58% of 59 retros land in 4 meta-tooling themes (packaging, lifecycle, policy, data)
F-007 Within-thread, messages near a practice event use half the tokens of messages far from any practice median 2.05× gap, 95% CI [1.23×, 2.47×], p=0.000456; Agent events (2.01× agg) ≫ Skill events (1.20× agg) as predicted by compression mechanism
F-008 Per-skill decomposition: subagent-style practices compress 2.5-3.25×, commit skills don’t compress (point estimate <1.0×) Explore 2.63× CI [1.71, 3.51] with 13/13 sign test; code-reviewer 3.25×; code-review:code-review 2.55×; vs commit 0.72×, commit-push-pr 0.76×

Status

These are N=1 findings. They describe what we measured on one developer’s history, not what is universally true across all AI-agent users. They are published because the mechanisms (subagent-aliased retry counts, role='user' pollution, size-confounded splits) are properties of the underlying tools (Claude Code, Codex) and will affect anyone running similar analyses — not just this dataset.

For discussion or replication on your own history: see history-pipeline.md for how to build the warehouse, warehouse-layering.md for the data model.