What this document is: How tests are structured, what helpers are available, and how to write new tests correctly.
When to read this:
Related docs:
Tests are split into two styles: unit tests (direct import) and CLI integration tests (in-process by default, subprocess for coverage mode). The canonical entry points are make verify-fast and make verify (both are final-gate commands before committing — not intermediate diagnostic tools). Every mutating command should have three test buckets: happy path, invalid-state rejection, and summary consistency.
make verify-fast # lint + typecheck + tests (~55s) — final gate before commit
make verify # full suite incl. bandit, pylint, complexity (~1.5min) — before committing
make test # pytest only
make lint # ruff only
make typecheck # mypy only
python -m pytest tests/test_workflow_fsm.py -v # single file
Configuration in pyproject.toml:
pythonpath = ["src"] — package path is pre-configured; no need to set PYTHONPATH=src manuallyai_agents_metricsBefore a commit:
make verify
Debugging a single test:
python -m pytest tests/test_metrics_cli.py::test_name -v -s
Running with subprocess coverage enabled:
CODEX_SUBPROCESS_COVERAGE=1 make test
| Test file | Covers |
|---|---|
test_metrics_cli.py |
CLI integration (in-process, subprocess for a few tests) |
test_metrics_domain.py |
Domain logic (unit) |
test_workflow_fsm.py |
State machine transitions |
test_history_{ingest,normalize,derive,compare,audit}.py |
Pipeline stages |
test_storage_roundtrip.py |
Event log I/O and replay |
test_{cost_audit,reporting,retro_timeline}.py |
Analysis and reporting |
test_{git_hooks,commit_message,public_boundary}.py |
Integrations |
test_observability.py |
Event store |
test_public_overlay.py |
Public/private sync |
test_claude_md.py |
Documentation generation |
conftest.py provides run_cli_inprocess() — an in-process CLI runner that calls main() directly with captured stdout/stderr and temporary os.chdir(). This eliminates Python startup overhead (~0.5s per subprocess call) and makes tests ~18x faster than the previous subprocess-based approach.
Tests use run_cli_inprocess() by default and fall back to real subprocess when CODEX_SUBPROCESS_COVERAGE=1 is set.
For domain logic, FSM, reporting, and other pure modules.
from ai_agents_metrics.workflow_fsm import classify_workflow_state, WorkflowState
def test_active_goal_detected() -> None:
state = classify_workflow_state(
active_goal_count=1,
started_work_detected=True,
git_available=True,
)
assert state == WorkflowState.ACTIVE_GOAL_EXISTS
@pytest.mark.parametrize is the standard pattern for FSM and validation tests:
@pytest.mark.parametrize(("input", "expected"), [
("success", True),
("fail", False),
])
def test_something(input: str, expected: bool) -> None:
...
For CLI commands. Use tmp_path as an isolated repo root. Tests call the CLI in-process by default via run_cli_inprocess() from conftest.py:
# Default: in-process call (fast, ~0.01s per invocation)
def run_cmd(tmp_path: Path, *args: str, extra_env=None) -> subprocess.CompletedProcess[str]:
...
# Subprocess: only for tests that need real process isolation
# (install-self, bootstrap wrapper, script shim, parallel lock)
def _run_cmd_subprocess(tmp_path: Path, *args: str, extra_env=None) -> subprocess.CompletedProcess[str]:
...
# Module entrypoint: tests python -m ai_agents_metrics (always subprocess)
def run_module_cmd(tmp_path: Path, *args: str, extra_env=None) -> subprocess.CompletedProcess[str]:
...
A per-test timeout of 5 seconds is enforced via pytest-timeout (pyproject.toml). Any new test exceeding this limit should use in-process execution or be investigated for unnecessary overhead.
End-to-end test pattern:
from ai_agents_metrics.domain import load_metrics
def read_metrics(repo: Path) -> dict:
return load_metrics(repo / "metrics" / "events.ndjson")
def test_start_and_finish(tmp_path: Path) -> None:
result = run_module_cmd(tmp_path, "init")
assert result.returncode == 0
result = run_module_cmd(tmp_path, "start-task", "--title", "My task", "--task-type", "product")
assert result.returncode == 0
data = read_metrics(tmp_path)
goals = data["goals"]
assert len(goals) == 1
assert goals[0]["status"] == "in_progress"
Do not read metrics/events.ndjson with json.loads directly. The file is NDJSON (one JSON object per line), not a single JSON document. Use load_metrics() or replay_events() to read it correctly.
test_metrics_domain.py defines factory functions with defaults and **overrides. They live locally in that file, not in conftest.py. Copy the pattern when needed:
def make_goal_dict(**overrides: object) -> dict[str, object]:
values = {
"goal_id": "goal-1",
"title": "Goal",
"goal_type": "product",
"supersedes_goal_id": None,
"status": "in_progress",
"attempts": 0,
"started_at": None,
"finished_at": None,
"cost_usd": None,
"input_tokens": None,
"cached_input_tokens": None,
"output_tokens": None,
"tokens_total": None,
"failure_reason": None,
"notes": None,
"agent_name": None,
"result_fit": None,
}
values.update(overrides)
return values
# Usage — only override what matters for the test:
goal = make_goal_dict(status="fail", failure_reason="unclear_task", attempts=1)
The same pattern exists for make_goal_record, make_effective_goal_record, and make_attempt_entry_record — dataclass versions of the same objects.
Timestamps in dataclass factories:
GoalRecord.started_at / finished_at(and the equivalent fields inAttemptEntryRecord/EffectiveGoalRecord) are typed asdatetime | None, notstr. The dataclass factories automatically parse string values via an internal_ts()helper, so passingstarted_at="2026-04-06T10:00:00+00:00"to a factory is fine. However, constructing a dataclass directly (e.g.GoalRecord(started_at="...")) will produce a type error — usedatetime.fromisoformat(...)orparse_iso_datetime_flexible(...)instead.
Tests for history_ingest / history_normalize / history_derive require creating SQLite databases with the correct schema.
create_codex_usage_sources(repo, ...) in test_metrics_cli.py creates:
codex_state.sqlite with a threads tablecodex_logs.sqlite with a logs tableThis is a test double for the real ~/.codex/state_5.sqlite and ~/.codex/logs_1.sqlite. Usage pattern:
def test_ingest(tmp_path: Path) -> None:
state_path, logs_path = create_codex_usage_sources(
tmp_path,
thread_id="thread-abc",
model="gpt-5",
input_tokens=1000,
)
summary = ingest_codex_history(
source_root=tmp_path,
warehouse_path=tmp_path / "warehouse.sqlite",
)
assert summary.threads_ingested == 1
From AGENTS.md — for update, merge-tasks, and sync flows:
summary in the JSON is consistent with goals/entriesdef test_update_happy_path(tmp_path): ...
def test_update_rejects_closed_goal(tmp_path): ...
def test_update_summary_stays_consistent(tmp_path): ...
Tests that invoke the CLI via subprocess are not covered by default.
To enable: CODEX_SUBPROCESS_COVERAGE=1 make test.
build_cmd and run_module_cmd automatically switch to coverage run --parallel-mode.
Some tests verify that show rejects invalid state by writing bad data directly into events.ndjson. These tests must write valid NDJSON events (one JSON object per line) — not raw JSON dicts or old-format payloads.
def test_invalid_goal_type_fails(repo: Path) -> None:
events_path = repo / "metrics" / "events.ndjson"
# Write a goal_started event with an invalid field value
invalid_goal = {"goal_id": "goal-1", "goal_type": "invalid_type", ...}
event = {"event_type": "goal_started", "ts": "2026-01-01T00:00:00+00:00",
"goal": invalid_goal, "entries": []}
events_path.write_text(json.dumps(event) + "\n", encoding="utf-8")
result = run_cmd(repo, "show")
assert result.returncode != 0
assert "invalid_type" in result.stderr
Common mistake: writing a raw JSON dict without the expected event shape to events.ndjson. During replay, a line with no event_type field is silently skipped, so the file loads as empty state — the test passes when it should fail.
Reading the event log as JSON:
events.ndjson is NDJSON, not a single JSON document. json.loads(path.read_text()) will fail. Use load_metrics(path) or replay_events(path) instead.
PYTHONPATH in a worktree:
.venv is a symlink to the main repo. In a worktree, always use PYTHONPATH=src or run via make.
Test mutating the real event log:
CLI commands in end-to-end tests must be run with cwd=tmp_path. This makes them resolve metrics/events.ndjson relative to tmp_path, not the actual repository.