What this document is: The rules that govern how data flows through the warehouse and what each layer is allowed to contain. Named layers (raw_*, normalized_*, classified derived_*, aggregate derived_*) exist to keep interpretation out of raw storage, to let each layer be rebuilt from the one below without loss, and to make classifier rules versionable without touching primary data.
When to read this:
Related docs:
| Layer | Purpose | Built from | Can be rebuilt losslessly |
|---|---|---|---|
1. raw_* |
Byte-perfect capture of source files | The filesystem | Yes — from source files |
2. normalized_* |
Deterministic parsing of raw content | raw_* only |
Yes — from raw_* |
3. Classified derived_* |
Per-item interpretation (one row per normalized item) | normalized_* only |
Yes — from normalized_* |
4. Aggregate derived_* |
Rollups, cross-item relationships, goal/attempt shape | Classified + normalized | Yes — from lower layers |
Every layer can be dropped and rebuilt from the layers below it. This is a non-negotiable invariant: it is the only reason the pipeline is debuggable.
Purpose: faithfully mirror what is on disk. Downstream layers may change behavior; raw must not.
Allowed in a raw row:
raw_events row)source_path, file_size, mtime, sha256, provider (inferred from ~/.codex vs ~/.claude), and directory-level identifiers Claude and Codex put directly into the path (thread_uuid, session_uuid)ingested_at, ingest_run_idForbidden in a raw row:
role, text, timestamp fields — that is normalization)Test of the invariant: deleting the raw layer and re-ingesting from the source files produces byte-identical rows (modulo ingested_at / ingest_run_id). If the new rows differ in any other column, something is doing interpretation and belongs in a later layer.
Purpose: turn raw content into stable, typed, deduplicated rows that every later stage can rely on. Still no semantics — a human tagger, a regex classifier, and an LLM classifier must all see the same normalized rows.
Allowed in a normalized row:
role, text, timestamp, model, message_id, session_id, thread_id)message_id seen twice → keep one)normalized_errors (proposed)parent_session_id in normalized_session_eventsnormalized_atForbidden in a normalized row:
text that decide what kind of message it isis_skill_template, is_retro_trigger, session_kind, etc.)Test of the invariant: rebuilding the normalized layer from the raw layer produces identical rows. If changing a regex config changes normalized output, the regex belongs in Layer 3.
derived_*Purpose: per-item interpretation. One row per normalized item, annotated with what the item IS. This is where regex, heuristics, and LLM classification live.
This layer is new as of 2026-04-19. Previously, per-item interpretation was mixed into Layer 4 aggregates, which made it hard to re-classify without rebuilding aggregates and hard to compare two classifier versions against each other.
Allowed in a classified row:
session_kind, message_kind, practice_type)parent_session_path, parent_source = how we decided)Confidence levels (explicit |
overlap |
template |
fallback |
similar) |
classifier_version — freezes which rules produced this rowclassified_at — when the classification ranForbidden in a classified row:
Versioning rule: every row carries the exact classifier_version that produced it. Bumping the version does NOT retroactively rewrite older rows — it writes a new set. Queries that want “current” classification filter by the latest version; comparison queries keep access to history.
Test of the invariant: running the same classifier version over the same normalized input produces identical rows (idempotent). Two different versions produce two co-existing sets of rows for the same items.
Session-level vs message-level classification are separate tables. Verified on the Claude Code warehouse (2026-04-19, see docs/private/product-hypotheses/H-040.md):
.jsonl file): kind ∈ {main, subagent, unknown}. Detectable deterministically from the file path — main sessions are UUID-named, subagent sessions are agent-*.jsonl or subagents/agent-acompact-*.jsonl. Parent-child links are extractable from Agent tool_use events in the parent with sub-second timestamp accuracy. No regex, no LLM, no heuristics needed.kind ∈ {human_authored, skill_template, context_inject, continuation_prompt, interruption_marker, unknown}. Detected by matching Skill tool_use events in the parent assistant payload, plus a small set of stable regex patterns for context/continuation markers. Skill invocations DO NOT spawn a new file — they inject templates into the parent session’s role='user' slot.The two layers must not be collapsed into one table: they describe different kinds of items (files vs messages), they use different signals (filename vs content), and they have different confidence profiles (filename is 100% deterministic; message content needs regex and may need LLM fallback). The classifier-version axis is also independent — a new skill added to the tool does not change session classification.
derived_*Purpose: rollups, cross-item relationships, and goal/attempt shape. This is the “metrics” layer and the main read path for reports.
Allowed in an aggregate row:
pipeline_version — freezes which aggregate logic produced this rowForbidden in an aggregate row:
None tokens → 0). Prefer an explicit coverage field and propagate NULL when coverage is incomplete.Test of the invariant: aggregates equal exactly what their source rows say. E.g. SUM(derived_projects.total_tokens) == SUM(derived_session_usage.total_tokens). These equalities are existing tests; new aggregates must add similar equality tests.
These should be enforced by tests in tests/public/ and by the validation code that runs at the end of each pipeline stage. Every invariant below is expressed so it can be checked with a single SQL query or a deterministic rebuild.
Raw ↔ filesystem
raw_events rows from that file carry the same file_sha256, and that value equals sha256(file_contents_on_disk).raw_events.payload. Validity is enforced at the normalize boundary, not at ingest. Rejecting malformed content at raw would be interpretation.Normalized ↔ raw
normalized_* row carries enough identifiers (source_path + stable per-item key, e.g. message_id, session_id) to locate its source line(s) in raw_events. Stated as a query: SELECT COUNT(*) FROM normalized_messages nm LEFT JOIN raw_events re ON re.source_path = nm.source_path AND ... WHERE re.event_id IS NULL must be zero.normalized_at timestamps and any ingest_run_id back-references.Classified ↔ normalized
derived_session_kinds.session_path ⊆ normalized_sessions.session_path (no orphans).(session_path, classifier_version) pair there is exactly one row.kind = 'main' ⟺ parent_session_path IS NULL (enforced as a CHECK constraint on the table).Aggregate ↔ classified + normalized
SUM(derived_projects.total_tokens) == SUM(derived_session_usage.total_tokens) (verified on current warehouse; already an implicit test).derived_goals.attempt_count == COUNT(derived_attempts WHERE thread_id = derived_goals.thread_id) (verified on current warehouse; 0 mismatches across 160 goals).derived_goals.main_attempt_count == COUNT(derived_session_kinds WHERE thread_id = derived_goals.thread_id AND kind = 'main' AND classifier_version = <current>).derived_attempts.session_path ⊆ normalized_sessions.session_path, derived_goals.thread_id ⊆ normalized_threads.thread_id, derived_session_usage.session_path ⊆ normalized_sessions.session_path. (All verified at 0 orphans on current warehouse.)Idempotency (across all layers)
ingested_at, normalized_at, classified_at, derived_at, and ingest_run_id. Any other column that changes between runs indicates non-determinism and is a bug.Three version axes, one column per axis on the relevant layer:
| Column | Lives in | Changes when |
|---|---|---|
ingest_run_id |
raw | A new ingest pass starts (usually every run) |
| (implicit: schema version) | all layers | Table schema changes — bump via migration |
classifier_version |
classified | Classifier rules change (new template, new model) |
pipeline_version |
aggregate | Aggregate logic changes |
Rule: bumping a downstream version never requires upstream data to be rewritten. Layer 1 never cares about classifier versions; Layer 3 never cares about aggregate versions.
Rules that consume external config (regex files, pricing tables) embed the config hash in the version string — e.g. classifier_version = 'v1-skills-' + sha8(skill_templates.yaml). Identical inputs → identical version → identical outputs.
The most common violations are “fields leaked upstream”:
normalized_* — e.g. normalized_threads.message_count. These blur the line between Layer 2 and Layer 4; they are tolerable where cheap to compute and stable, but should be treated as a smell.raw_* — these belong in Layer 2.Known current violations of these rules (to be fixed as a separate refactor, not as part of new feature work) are tracked in architecture/ARCH-018-layer-separation-cleanup.md.
Checklist when a change spans layers:
classifier_version if new classification, pipeline_version if new aggregate.