ai-agents-metrics

Testing Guide

What this document is: How tests are structured, what helpers are available, and how to write new tests correctly.

When to read this:

Writing a new test or adding coverage to an existing module
Debugging a test failure and not sure why it is behaving unexpectedly
Setting up a new worktree or environment and need to run tests

Related docs:

architecture.md — what each module does and where it lives
data-schema.md — shape of the data used in test fixtures

Summary

Tests are split into two styles: unit tests (direct import) and CLI integration tests (in-process by default, subprocess for coverage mode). The canonical entry points are make verify-fast and make verify (both are final-gate commands before committing — not intermediate diagnostic tools). Every mutating command should have three test buckets: happy path, invalid-state rejection, and summary consistency.

Quick start

make verify-fast     # lint + typecheck + tests (~55s) — final gate before commit
make verify          # full suite incl. bandit, pylint, complexity (~1.5min) — before committing
make test            # pytest only
make lint            # ruff only
make typecheck       # mypy only
python -m pytest tests/workflow/test_workflow_fsm.py -v   # single file

Configuration in pyproject.toml:

pythonpath = ["src"] — package path is pre-configured; no need to set PYTHONPATH=src manually
Coverage: branch mode, parallel, source = ai_agents_metrics

Common workflows

Before a commit:

make verify

Debugging a single test:

python -m pytest tests/cli/test_metrics_cli.py::test_name -v -s

Running with subprocess coverage enabled:

CODEX_SUBPROCESS_COVERAGE=1 make test

Structure: one file per module, grouped by subject area

Tests live under tests/<area>/test_*.py. Pick the area that matches the module under test; if none fits cleanly, add a new subdir rather than dropping the file at the root.

Subdir	Test file	Covers
`cli/`	`test_metrics_cli.py`	CLI integration (in-process; subprocess for a few script-shim tests)
`domain/`	`test_metrics_domain{,_properties}.py`	Domain logic + hypothesis invariants
`history/`	`test_history_{ingest,normalize,normalize_properties,derive,classify,compare,audit,pipeline_json}.py`	Pipeline stages
`reporting/`	`test_{html_report,reporting,retro_timeline,show_json}.py`	Analysis and report rendering
`workflow/`	`test_workflow_fsm.py`, `test_git_state.py`, `test_commit_message.py`	State machine, git + hook integrations
`infra/`	`test_{public_boundary,public_overlay,security,storage_roundtrip,observability,cost_audit}.py`	Boundary rules, sync, event log I/O, observability
`strategies/`	`domain.py`, `history.py`	Hypothesis strategies shared across property tests
`tests/private/` (private root)	`test_git_hooks.py`, `test_claude_md.py`	Git hook behavior, doc generation

conftest.py

conftest.py exposes three shared surfaces every test area imports:

run_cli_inprocess() — in-process CLI runner that calls main() directly with captured stdout/stderr and a temporary os.chdir(). Eliminates Python startup overhead (~0.5s per subprocess call) and makes tests ~18x faster than a subprocess-based approach. Tests use it by default and fall back to real subprocess when CODEX_SUBPROCESS_COVERAGE=1 is set.
find_repo_paths() — returns (repo_root, scripts_dir, src_dir) by walking up to the first pyproject.toml with a [tool.codex_tests] section. Prefer it over Path(__file__).parents[N] so test paths stay valid when files move between subdirs. Cached with @lru_cache so it runs once per process.
_repo_template (session-scoped) + repo (function-scoped) — a git baseline built once and hardlinked per test with cp -rl. Replaces the per-file repo fixtures that used to spawn five git subprocesses per test. See the CLI integration section and decisions.md for the full rationale.

conftest.py also inserts every immediate subdir of tests/ into sys.path, so cross-test imports like from test_history_ingest import run_cmd keep working from any area.

Two test styles

1. Unit tests via direct import

For domain logic, FSM, reporting, and other pure modules.

from ai_agents_metrics.workflow_fsm import classify_workflow_state, WorkflowState

def test_active_goal_detected() -> None:
    state = classify_workflow_state(
        active_goal_count=1,
        started_work_detected=True,
        git_available=True,
    )
    assert state == WorkflowState.ACTIVE_GOAL_EXISTS

@pytest.mark.parametrize is the standard pattern for FSM and validation tests:

@pytest.mark.parametrize(("input", "expected"), [
    ("success", True),
    ("fail", False),
])
def test_something(input: str, expected: bool) -> None:
    ...

2. CLI integration tests (in-process)

For CLI commands. Use tmp_path as an isolated repo root. Tests call the CLI in-process by default via run_cli_inprocess() from conftest.py:

# Default: in-process call (fast, ~0.01s per invocation)
def run_cmd(tmp_path: Path, *args: str, extra_env=None) -> subprocess.CompletedProcess[str]:
    ...

# Subprocess: only for tests that need real process isolation
# (install-self, bootstrap wrapper, script shim, parallel lock)
def _run_cmd_subprocess(tmp_path: Path, *args: str, extra_env=None) -> subprocess.CompletedProcess[str]:
    ...

# Module entrypoint: tests python -m ai_agents_metrics (always subprocess)
def run_module_cmd(tmp_path: Path, *args: str, extra_env=None) -> subprocess.CompletedProcess[str]:
    ...

A per-test timeout of 5 seconds is enforced via pytest-timeout (pyproject.toml). Any new test exceeding this limit should use in-process execution or be investigated for unnecessary overhead. Tests that legitimately spawn multiple real Python subprocesses (not the in-process fast path) are rare — when one is unavoidable, override explicitly at the call site with @pytest.mark.timeout(15) rather than bumping the repo-wide default. Masking the 5s gate globally would also hide real regressions elsewhere.

The repo fixture (shared across all test subdirs): tests/conftest.py ships a session-scoped _repo_template (git-initialized repo with src/, scripts/, pricing/, and a baseline commit) that the function-scoped repo fixture hardlinks into each test’s tmp_path via cp -rl. The same fixture serves cli/, history/, and any future subdir — do not redefine repo locally (the five local copies that used to spawn git init + two git config

git add + git commit per test were the dominant xdist flake source on 1-CPU runners and were removed in PR #49; see decisions.md).

Template files are chmod a-w after build, so any write_text() on a template-originated path (src/**, scripts/metrics_cli.py, pricing/model_pricing.json, .git/**) will raise PermissionError — this is intentional: overwriting a hardlinked file would mutate the shared inode and poison the template for every subsequent test. Create new files under src/ (e.g. worktree_change.py) or new top-level paths instead. If a test needs a variant repo state, extend via a sibling fixture that takes repo as input rather than duplicating the template build.

End-to-end test pattern:

from ai_agents_metrics.domain import load_metrics

def read_metrics(repo: Path) -> dict:
    return load_metrics(repo / "metrics" / "events.ndjson")

def test_start_and_finish(tmp_path: Path) -> None:
    result = run_module_cmd(tmp_path, "init")
    assert result.returncode == 0

    result = run_module_cmd(tmp_path, "start-task", "--title", "My task", "--task-type", "product")
    assert result.returncode == 0

    data = read_metrics(tmp_path)
    goals = data["goals"]
    assert len(goals) == 1
    assert goals[0]["status"] == "in_progress"

Do not read metrics/events.ndjson with json.loads directly. The file is NDJSON (one JSON object per line), not a single JSON document. Use load_metrics() or replay_events() to read it correctly.

Object factories

domain/test_metrics_domain.py defines factory functions with defaults and **overrides. They live locally in that file, not in conftest.py. Copy the pattern when needed:

def make_goal_dict(**overrides: object) -> dict[str, object]:
    values = {
        "goal_id": "goal-1",
        "title": "Goal",
        "goal_type": "product",
        "supersedes_goal_id": None,
        "status": "in_progress",
        "attempts": 0,
        "started_at": None,
        "finished_at": None,
        "cost_usd": None,
        "input_tokens": None,
        "cached_input_tokens": None,
        "output_tokens": None,
        "tokens_total": None,
        "failure_reason": None,
        "notes": None,
        "agent_name": None,
        "result_fit": None,
    }
    values.update(overrides)
    return values

# Usage — only override what matters for the test:
goal = make_goal_dict(status="fail", failure_reason="unclear_task", attempts=1)

The same pattern exists for make_goal_record, make_effective_goal_record, and make_attempt_entry_record — dataclass versions of the same objects.

Timestamps in dataclass factories: GoalRecord.started_at / finished_at (and the equivalent fields in AttemptEntryRecord / EffectiveGoalRecord) are typed as datetime | None, not str. The dataclass factories automatically parse string values via an internal _ts() helper, so passing started_at="2026-04-06T10:00:00+00:00" to a factory is fine. However, constructing a dataclass directly (e.g. GoalRecord(started_at="...")) will produce a type error — use datetime.fromisoformat(...) or parse_iso_datetime_flexible(...) instead.

Testing with SQLite (history pipeline)

Tests for history_ingest / history_normalize / history_derive require creating SQLite databases with the correct schema.

create_codex_usage_sources(repo, ...) in cli/test_metrics_cli.py creates:

codex_state.sqlite with a threads table
codex_logs.sqlite with a logs table

This is a test double for the real ~/.codex/state_5.sqlite and ~/.codex/logs_1.sqlite. Usage pattern:

def test_ingest(tmp_path: Path) -> None:
    state_path, logs_path = create_codex_usage_sources(
        tmp_path,
        thread_id="thread-abc",
        model="gpt-5",
        input_tokens=1000,
    )
    summary = ingest_codex_history(
        source_root=tmp_path,
        warehouse_path=tmp_path / "warehouse.sqlite",
    )
    assert summary.threads_ingested == 1

Three required buckets for mutating commands

From AGENTS.md — for update, merge-tasks, and sync flows:

Happy path — successful execution; verify file state after the call
Invalid-state rejection — command must exit with a non-zero return code
Summary/report consistency — after mutation, summary in the JSON is consistent with goals/entries

def test_update_happy_path(tmp_path): ...
def test_update_rejects_closed_goal(tmp_path): ...
def test_update_summary_stays_consistent(tmp_path): ...

Coverage with subprocess

Tests that invoke the CLI via subprocess are not covered by default. To enable: CODEX_SUBPROCESS_COVERAGE=1 make test. build_cmd and run_module_cmd automatically switch to coverage run --parallel-mode.

Inject-corrupt-data tests

Some tests verify that show rejects invalid state by writing bad data directly into events.ndjson. These tests must write valid NDJSON events (one JSON object per line) — not raw JSON dicts or old-format payloads.

def test_invalid_goal_type_fails(repo: Path) -> None:
    events_path = repo / "metrics" / "events.ndjson"
    # Write a goal_started event with an invalid field value
    invalid_goal = {"goal_id": "goal-1", "goal_type": "invalid_type", ...}
    event = {"event_type": "goal_started", "ts": "2026-01-01T00:00:00+00:00",
             "goal": invalid_goal, "entries": []}
    events_path.write_text(json.dumps(event) + "\n", encoding="utf-8")

    result = run_cmd(repo, "show")
    assert result.returncode != 0
    assert "invalid_type" in result.stderr

Common mistake: writing a raw JSON dict without the expected event shape to events.ndjson. During replay, a line with no event_type field is silently skipped, so the file loads as empty state — the test passes when it should fail.

Common pitfalls

Reading the event log as JSON: events.ndjson is NDJSON, not a single JSON document. json.loads(path.read_text()) will fail. Use load_metrics(path) or replay_events(path) instead.

PYTHONPATH in a worktree: .venv is a symlink to the main repo. In a worktree, always use PYTHONPATH=src or run via make.

Test mutating the real event log: CLI commands in end-to-end tests must be run with cwd=tmp_path. This makes them resolve metrics/events.ndjson relative to tmp_path, not the actual repository.

This site is open source. Improve this page.