ai-agents-metrics

F-006 — Retrospectives in an AI-agent workflow overwhelmingly describe meta-tooling failures, not product-code failures

Dataset: 59 retro files in docs/private/retros/, committed between 2026-03-29 and 2026-04-16 (19 days, single developer on the codex-metrics repo). Classified 2026-04-19.

TL;DR

Across 59 retrospective files written during a single-developer AI-agent project, four meta-tooling themes — packaging/install, lifecycle/workflow, policy (AGENTS.md), and data/warehouse — account for 58% of retros. The remaining 42% spread across product/PM, testing, git/CI, refactors, and dev-workflow fixes. Almost none are “the agent wrote wrong product logic.” Writing AI-agent software breaks at the seams between tools, not inside the code the tools help produce.

Retro cadence is front-loaded: 49 of 59 retros land in the first 7 days (including a 32-retro bulk-reflection day on 2026-03-29), and only ~5 per week thereafter. The infrastructure-pain rate visibly decays as the workflow stabilizes.

Why measure this at all

When someone asks “does writing retros help?” the clean causal answer would be pre/post behavior metrics around each retro. That experiment is not possible on this dataset — see Limitations. But a descriptive question is still answerable: what do AI-agent retros actually describe? The answer shapes where teams should invest guardrail effort.

Method

For each file in docs/private/retros/:

Classification keywords are listed in the analysis script in-line; the mapping is intentionally simple so anyone can reproduce by rerunning against filenames alone.

Results — retros by theme

Theme Retros Share
packaging_install 9 15.3%
lifecycle_workflow 9 15.3%
policy_agents_md 8 13.6%
data_warehouse 8 13.6%
product_pm 7 11.9%
testing_quality 5 8.5%
git_ci 4 6.8%
refactor_arch 3 5.1%
unclassified 3 5.1%
dev_workflow 2 3.4%
cli_ux 1 1.7%

The top four themes are all meta-tooling. Packaging/install covers PEP 639 classifier conflicts, standalone-binary drift, bootstrap-marker rename, venv-install staleness. Lifecycle/workflow covers task-start/finish guards, late-commit recovery, handoff QA. Policy covers AGENTS.md boundary rules, external-policy overreach, invariant normalization. Data/warehouse covers history-pipeline audits, cost coverage, usage-recovery format mismatch, model tracking.

Only 12% of retros are product-PM. Even within those, most describe framing/positioning or hypothesis-method issues — not “we built the wrong feature.”

Zero retros describe the agent writing semantically wrong code. The incidents are shape-of-the-system problems: permissions, paths, renames, lifecycle ordering, policy sync. When an AI agent is responsible for implementation, the failure surface shifts upward — humans retrospect on the scaffolding, not the code.

Results — retro cadence decay

Window Retros Daily rate
Day 0 (2026-03-29) — bulk reflection 32 n/a
Week 0, excluding bulk day (2026-03-30 → 2026-04-04) 17 2.8 / day
Week 1 (2026-04-05 → 2026-04-11) 5 0.7 / day
Week 2 (2026-04-12 → 2026-04-16, partial) 5 1.0 / day

After the initial backfill burst, per-day retro density drops ~4× from week-0-residual to week-1 and stays flat around 1/day. This is consistent with the infrastructure-maturing interpretation: once the scaffolding is settled, fewer scaffolding failures surface. It is not strong evidence of a learning effect — cadence could drop for many reasons (less new work, fatigue, batching) — but the rate change is not subtle.

Limitations

1. Retro-treatment effect is unmeasurable on this dataset. The warehouse (derived_goals.first_seen_at from warehouse-full.sqlite) starts on 2026-03-29 — the same day as the first retros. There is no pre-retro behavior window to compare against. For the three post-treatment metrics we would care about (retry ratio, tokens-per-message, practice-event density), there is no valid counterfactual.

Even within the recorded window, the 2026-03-29 bulk-reflection day produces ~54% of all retros as effectively one treatment, with only 4-5 threads of post-treatment data before the warehouse cutoff. Any causal claim would be fitting noise.

2. Classification is keyword-based, not semantic. 19 of 59 retros match multiple themes; we assign the first match. A proper thematic analysis would require reading each file, which is not needed for the top-theme-share claim but would sharpen mid-tier categories.

3. N=1 developer. All retros are from a single developer’s habit of writing them. The 58% meta-tooling concentration may be a personal pattern, a tooling-immaturity pattern, or an AI-agent-workflow pattern. Cross-developer replication is required to separate these.

Implications

For anyone running AI-agent projects:

Reproduction