ai-agents-metrics

Decision Log

What this document is: Key architectural and design decisions — why things are the way they are.

When to read this:

Related docs:


Summary

New entries should follow the format below. Add entries as decisions are made or recalled — not just for new work, but also when the reasoning behind existing choices becomes clear.

## Decision title

**Context:** Why this decision was needed.

**Decision:** What was decided.

**Trade-offs:** Known costs or limitations.

Append-only NDJSON event log, not a mutable JSON file

Context: The tool tracks metrics for AI agent tasks across parallel git worktrees. The data needs to be stored persistently without causing merge conflicts.

Decision: metrics/events.ndjson is the source of truth — an append-only NDJSON log where each CLI command adds one line. State is reconstructed at read time by replaying all events in file order, last-write-wins per goal_id / entry_id. The summary is always computed in-memory; it is never stored.

Trade-offs:

Supersedes: the earlier decision to use metrics/ai_agents_metrics.json as a mutable JSON file (removed from git tracking; added to .gitignore).

Why this works: Append-only log eliminates git merge conflicts — parallel worktrees each append new lines, and git merge can automatically concatenate them without conflict. Human-readable and git-diffable. Replay is deterministic and can be inspected independently.


fcntl for cross-process locking, not threading.Lock

Context: Multiple CLI invocations may run concurrently against the same events.ndjson.

Decision: storage.metrics_mutation_lock uses fcntl.flock.

Trade-offs: fcntl is POSIX-only — no Windows support. Acceptable because the tool targets macOS/Linux developer environments.

Why this works: The tool runs as short-lived CLI processes, not a long-lived server. Concurrent access comes from multiple processes, not threads within one process. fcntl.flock works across processes; threading.Lock does not.


History pipeline as a separate SQLite warehouse

Context: Codex agent stores session history in ~/.codex/state_5.sqlite and ~/.codex/logs_1.sqlite. The tool needs to derive goal history from this raw data.

Decision: A three-stage pipeline (ingest → normalize → derive) with an intermediate SQLite warehouse at .ai-agents-metrics/warehouse.db, separate from the primary JSON store.

Trade-offs: Inter-stage contracts exist only as SQLite column names, not Python types (tracked in ARCH-006).

Why this works:


Timestamps stored as ISO strings in the event log

Context: GoalRecord and AttemptEntryRecord have started_at and finished_at fields that must round-trip through JSON.

Decision: Timestamps are stored as ISO 8601 strings in events.ndjson. In-memory Python representation is datetime | None, parsed and serialised exclusively in domain/serde.py.

Trade-offs: Two parse functions exist (parse_iso_datetime and parse_iso_datetime_flexible) because input format is not normalised at the boundary. This is a known weakness tracked in ARCH-003.

Why this works: ISO strings serialise directly to JSON without a custom encoder and remain human-readable in the event log.


HTML report uses warehouse as primary source for token and retry data

Context: render-html initially read all four charts from the ndjson ledger. The ledger starts at the first manually-tracked goal (2026-04-07); the warehouse covers all sessions from the first ingest (2026-03-31). Three of four charts showed only ~4 days of history, while full project history was available in the warehouse.

Decision: Charts 2 (Retry Pressure) and 3 (Token Cost) are warehouse-first: they query derived_goals JOIN derived_session_usage for per-thread token counts and retry counts. The ledger remains the sole source for Charts 1 and 4, which require goal_type and cost_usd — fields only present in manually-tracked goals.

Trade-offs:

Why this works: The warehouse has full token breakdown per thread back to first ingest with no migration needed. The inconsistency between ledger and warehouse date ranges is surfaced as an explicit UX feature via section headers (“Goals Ledger” vs “Session History”) rather than hidden.


Chart 3 stacks by model instead of by token category

Context: Chart 3 originally stacked input / cached-input / output tokens (or cost). Product QA (ARCH-017) showed this answered “what share is cached?” — a token-composition question — rather than “where is my money going?” — the primary cost-tracking question for a user running multiple models at different prices (e.g. Opus vs Sonnet).

Decision: Chart 3 now stacks one series per model. Colors are assigned deterministically from a fixed 8-color palette sorted by model name, so the same model always gets the same color across runs. The reserved “unknown” bucket is pinned last in slate.

Trade-offs:

Why this works: Model breakdown is the actionable cost dimension for agent-assisted workflows. ARCH-016 populated model on every warehouse table, making this chart trustworthy from the first render.


html_report.py split into four focused modules

Context: html_report.py grew to 1084 lines as the HTML template, aggregation logic, date helpers, and public API accumulated in one file. Diffs and code review were impractical; the ~730-line template string dominated the file.

Decision: The file is split into:

Trade-offs: Three new private modules with underscore-prefixed names. Tests import from the sub-modules directly.

Why this works: Each module has exactly one reason to change. The public import surface is preserved; commands.py and any downstream code importing from html_report requires no changes.


cli.py as a re-export facade

Context: Early in the project, external scripts and tests imported symbols directly from cli.py before the module structure was stable.

Decision: cli.py re-exports ~50 symbols from domain, reporting, and storage to maintain backward compatibility.

Trade-offs: Any code importing from cli pulls the entire CLI layer as a dependency. Adding a new domain function requires updating the re-export list. This is a known weakness tracked in ARCH-001. ARCH-032 (2026-04-22) removed 9 re-exports that were kept only for a reflective test pattern; test_metrics_domain.py now imports directly from usage_resolution / pricing_runtime / runtime_facade.


Oversized-file splits into packages (ARCH-027 / ARCH-028 / ARCH-034)

Context: By mid-April 2026 four modules had drifted past 900 lines: commands.py (1340), runtime_facade.py (927), history/ingest.py (1152), and cli.py (1091). Files that large strain human review and exceed the single-tool-call budget for AI-agent contributors.

Decision: Split each into a package that preserves the import surface.

Before After Direction
commands.py commands/install.py, history.py, tasks.py, report.py, misc.py, _runtime.py, __init__.py cluster-per-command
runtime_facade.py runtime_facade/orchestration.py, costs.py, mutations.py, __init__.py mutations → costs → orchestration
history/ingest.py history/ingest/warehouse.py, codex.py, claude.py, __init__.py adapters → warehouse
cli.py cli.py (dispatch + facade) + cli_parsers.py (argparse) + cli_constants.py (paths) extract, not package

Why packages, not just more files: Each __init__.py re-exports the full public surface so existing importers (from ai_agents_metrics import commands, from ai_agents_metrics.history.ingest import IngestSummary) resolve unchanged. scripts/metrics_cli.py’s reflective globals().update(vars(cli)) shim keeps working without edits. Tests that imported private helpers (_encode_claude_cwd, _ensure_schema, etc.) continue to work via __init__.py re-exports.

Trade-offs:

Why this works: Direction-of-dependency is one-way inside each package (validated by lint-imports). No file now exceeds the pylint max-module-lines = 1000 threshold. The only remaining too-many-lines suppressions were stale and have been dropped (ARCH-031).


mypy --strict globally (ARCH-030)

Context: Strict type-checking was partial: [tool.mypy] enabled a handful of individual flags (check_untyped_defs, no_implicit_optional, disallow_incomplete_defs), and ARCH-029 introduced a per-module override for domain/* and history/* using the explicit strict flag set.

Decision: Promote strict = true to the top-level [tool.mypy] section. All 65 source files (src/ + scripts/) now pass mypy --strict.

Trade-offs:

Why this works: The codebase was already mostly strict-clean thanks to years of incremental typing. The cost of turning the screw the rest of the way was measured (3 fixes total) and locked in via the global config.