What this document is: Key architectural and design decisions — why things are the way they are.
When to read this:
Related docs:
New entries should follow the format below. Add entries as decisions are made or recalled — not just for new work, but also when the reasoning behind existing choices becomes clear.
## Decision title
**Context:** Why this decision was needed.
**Decision:** What was decided.
**Trade-offs:** Known costs or limitations.
Context: The tool tracks metrics for AI agent tasks across parallel git worktrees. The data needs to be stored persistently without causing merge conflicts.
Decision: metrics/events.ndjson is the source of truth — an append-only NDJSON log where each CLI command adds one line. State is reconstructed at read time by replaying all events in file order, last-write-wins per goal_id / entry_id. The summary is always computed in-memory; it is never stored.
Trade-offs:
load_metrics call (acceptable for hundreds of goals).tasks / goals legacy alias is normalised in-memory during replay, not persisted.Supersedes: the earlier decision to use metrics/ai_agents_metrics.json as a mutable JSON file (removed from git tracking; added to .gitignore).
Why this works: Append-only log eliminates git merge conflicts — parallel worktrees each append new lines, and git merge can automatically concatenate them without conflict. Human-readable and git-diffable. Replay is deterministic and can be inspected independently.
Context: Multiple CLI invocations may run concurrently against the same events.ndjson.
Decision: storage.metrics_mutation_lock uses fcntl.flock.
Trade-offs: fcntl is POSIX-only — no Windows support. Acceptable because the tool targets macOS/Linux developer environments.
Why this works: The tool runs as short-lived CLI processes, not a long-lived server. Concurrent access comes from multiple processes, not threads within one process. fcntl.flock works across processes; threading.Lock does not.
Context: Codex agent stores session history in ~/.codex/state_5.sqlite and ~/.codex/logs_1.sqlite. The tool needs to derive goal history from this raw data.
Decision: A three-stage pipeline (ingest → normalize → derive) with an intermediate SQLite warehouse at .ai-agents-metrics/warehouse.db, separate from the primary JSON store.
Trade-offs: Inter-stage contracts exist only as SQLite column names, not Python types (tracked in ARCH-006).
Why this works:
Context: GoalRecord and AttemptEntryRecord have started_at and finished_at fields that must round-trip through JSON.
Decision: Timestamps are stored as ISO 8601 strings in events.ndjson. In-memory Python representation is datetime | None, parsed and serialised exclusively in domain/serde.py.
Trade-offs: Two parse functions exist (parse_iso_datetime and parse_iso_datetime_flexible) because input format is not normalised at the boundary. This is a known weakness tracked in ARCH-003.
Why this works: ISO strings serialise directly to JSON without a custom encoder and remain human-readable in the event log.
Context: render-html initially read all four charts from the ndjson ledger. The ledger starts at the first manually-tracked goal (2026-04-07); the warehouse covers all sessions from the first ingest (2026-03-31). Three of four charts showed only ~4 days of history, while full project history was available in the warehouse.
Decision: Charts 2 (Retry Pressure) and 3 (Token Cost) are warehouse-first: they query derived_goals JOIN derived_session_usage for per-thread token counts and retry counts. The ledger remains the sole source for Charts 1 and 4, which require goal_type and cost_usd — fields only present in manually-tracked goals.
Trade-offs:
Why this works: The warehouse has full token breakdown per thread back to first ingest with no migration needed. The inconsistency between ledger and warehouse date ranges is surfaced as an explicit UX feature via section headers (“Goals Ledger” vs “Session History”) rather than hidden.
Context: Chart 3 originally stacked input / cached-input / output tokens (or cost). Product QA (ARCH-017) showed this answered “what share is cached?” — a token-composition question — rather than “where is my money going?” — the primary cost-tracking question for a user running multiple models at different prices (e.g. Opus vs Sonnet).
Decision: Chart 3 now stacks one series per model. Colors are assigned deterministically from a fixed 8-color palette sorted by model name, so the same model always gets the same color across runs. The reserved “unknown” bucket is pinned last in slate.
Trade-offs:
Why this works: Model breakdown is the actionable cost dimension for agent-assisted workflows. ARCH-016 populated model on every warehouse table, making this chart trustworthy from the first render.
Context: html_report.py grew to 1084 lines as the HTML template, aggregation logic, date helpers, and public API accumulated in one file. Diffs and code review were impractical; the ~730-line template string dominated the file.
Decision: The file is split into:
_report_buckets.py — pure date/bucket helpers (no I/O, no side effects)_report_aggregation.py — all aggregation logic; _apply_token_pricing extracted to eliminate duplication between the warehouse and ledger token paths_report_template.py — the HTML/CSS/JS template string (inert data, no Python logic)html_report.py — thin 37-line facade; public API (aggregate_report_data, render_html_report) unchangedTrade-offs: Three new private modules with underscore-prefixed names. Tests import from the sub-modules directly.
Why this works: Each module has exactly one reason to change. The public import surface is preserved; commands.py and any downstream code importing from html_report requires no changes.
Context: Early in the project, external scripts and tests imported symbols directly from cli.py before the module structure was stable.
Decision: cli.py re-exports ~50 symbols from domain, reporting, and storage to maintain backward compatibility.
Trade-offs: Any code importing from cli pulls the entire CLI layer as a dependency. Adding a new domain function requires updating the re-export list. This is a known weakness tracked in ARCH-001. ARCH-032 (2026-04-22) removed 9 re-exports that were kept only for a reflective test pattern; test_metrics_domain.py now imports directly from usage_resolution / pricing_runtime / runtime_facade.
Context: By mid-April 2026 four modules had drifted past 900 lines:
commands.py (1340), runtime_facade.py (927), history/ingest.py (1152),
and cli.py (1091). Files that large strain human review and exceed the
single-tool-call budget for AI-agent contributors.
Decision: Split each into a package that preserves the import surface.
| Before | After | Direction |
|---|---|---|
commands.py |
commands/ — install.py, history.py, tasks.py, report.py, misc.py, _runtime.py, __init__.py |
cluster-per-command |
runtime_facade.py |
runtime_facade/ — orchestration.py, costs.py, mutations.py, __init__.py |
mutations → costs → orchestration |
history/ingest.py |
history/ingest/ — warehouse.py, codex.py, claude.py, __init__.py |
adapters → warehouse |
cli.py |
cli.py (dispatch + facade) + cli_parsers.py (argparse) + cli_constants.py (paths) |
extract, not package |
Why packages, not just more files: Each __init__.py re-exports the
full public surface so existing importers (from ai_agents_metrics import
commands, from ai_agents_metrics.history.ingest import IngestSummary)
resolve unchanged. scripts/metrics_cli.py’s reflective globals().update(vars(cli))
shim keeps working without edits. Tests that imported private helpers
(_encode_claude_cwd, _ensure_schema, etc.) continue to work via
__init__.py re-exports.
Trade-offs:
git blame at file level
loses continuity (mitigated by atomic move commits).history/ingest/__init__.py document
a minor boundary leak; tracked for future migration.Why this works: Direction-of-dependency is one-way inside each package
(validated by lint-imports). No file now exceeds the pylint
max-module-lines = 1000 threshold. The only remaining too-many-lines
suppressions were stale and have been dropped (ARCH-031).
--strict globally (ARCH-030)Context: Strict type-checking was partial: [tool.mypy] enabled a
handful of individual flags (check_untyped_defs, no_implicit_optional,
disallow_incomplete_defs), and ARCH-029 introduced a per-module override
for domain/* and history/* using the explicit strict flag set.
Decision: Promote strict = true to the top-level [tool.mypy]
section. All 65 source files (src/ + scripts/) now pass mypy --strict.
Trade-offs:
usage_backends.py:135
(sqlite3.Cursor.fetchone() is typeshed-typed Any | None), explicit
-> ModuleType annotations on two bootstrap shim files, and dict →
dict[str, Any] in one permission-audit helper.strict = true in a
per-module override leaks warn_return_any into unrelated modules;
top-level strict = true is not affected.Why this works: The codebase was already mostly strict-clean thanks to years of incremental typing. The cost of turning the screw the rest of the way was measured (3 fixes total) and locked in via the global config.