Skip to content

Canonical source

This page mirrors spec.md at the repository root. Edit that file, not this one.

Truth Atlas — Technical Specification

Field Value
Version 5.1.0
Last Updated April 2026
Status Active
Related Architecture Reference, Crawl4AI Reference, Brainstorms & Actionables, Architecture Reference Update Requirements, Summary of Adopted Changes, Revision Suggestions
Supersedes Spec v5.0.0

This spec is designed to be readable standalone. The Architecture Reference and Update Requirements documents remain the long-form rationale and history; everything a contributor needs to implement the current system is in this file.

v5.1.0 highlights: (1) Schema trimmed to 8 structural tables plus stars: authors / authorship / anchors dropped (author kept as a raw string on documents), history / notes / pipeline_progress / meta deferred to V2 or retired, and decorative/display-only columns removed from documents (21 columns, down from ~35). stars is retained as the one V1 user-layer table. (2) A single enrichment_model_version replaces 14 per-field *_source provenance columns. (3) Meilisearch is replaced by SQLite FTS5 — one fewer process, same query shape. (4) Pydantic (with SQLModel + Outlines / instructor) is adopted at every boundary: LLM I/O, config, HTTP API, and CrawlResult → row normalization. (5) The local enrichment model shifts from Llama-3-8B-class to Qwen 3 4B Instruct (strongest open-weight 3–4B-class model for structured extraction as of early 2026).

v5.0.0 highlights: (1) Discrete clustering on diffusion coordinates is removed; the atlas is now a continuous manifold colored by eigenvectors, with TF-IDF regional labels replacing cluster labels. (2) Route planning is elaborated with greedy diversification, persisted adjacency, and a full UI flow (entry, overlay, comparison sidebar, Google-Maps-style breadcrumb).


1. What Truth Atlas Is

Truth Atlas is a personal tool for reading the web the way you'd read an atlas or Google Maps: start somewhere, see where you are, and decide where to go next. You give it a handful of seed URLs on a topic you care about. It crawls outward from those seeds, reads each page, and records what the page is about, what it argues, who it links to, and how good it looks. The result is a database of digested web pages. On top of that database sit two views: a map that shows the whole collection at a glance, and a depth-first browser that walks through it one document at a time, following the trail of ideas rather than the tree of links. There is no curated taxonomy, no hand-picked ontology, and no separate graph engine. Everything the user sees is computed from the database, and every structural claim the system makes — neighborhoods, regions, reading paths — is derived from the data itself.


2. Theoretical Foundation

Three ideas shape the design:

  • Diffusion maps for geometry. Distance between two documents is diffusion distance: how quickly a random walker on the weighted edge graph gets from A to B. Densely connected regions collapse; semantically similar but isolated pages stay far apart. The diffusion coordinates are the structure — no discrete partition is imposed on top. Density in diffusion space is visualized continuously, the way terrain is visualized on a topographic map.
  • Custom traversal engine for dynamics. At each step of a reading session, the engine collects candidate next documents, scores them by marginal information gain × epistemic quality, and returns the top few. A plain deterministic Python function — no learned model, no imposed ontology.
  • Information foraging for interaction. Users maximize information gain per unit click; the browser shows a small number of candidates with short inline reasons and dims (rather than hides) already-seen content.

Category theory, hypergraph formalism, and flow matching were considered and dropped — they added vocabulary without adding capability for a reading-path use case.


3. Scope and Non-Goals

Truth Atlas is platform-agnostic: the same codebase runs on a headless workstation, or a GPU server. All hardware-dependent values (concurrency, batch sizes, memory cap, GPU vs. CPU eigensolver) are driven from config, never hard-coded. Target corpus scale is 10–25 million documents per topic-focused session, not a full-web index. The system is read-heavy once a corpus is built, so SQLite in WAL mode is the primary store. Full-text search runs inside SQLite via FTS5 — no separate Meilisearch process.

Non-goals: no imposed taxonomy tree, no imposed discrete clustering, no generic hyperedge store, no first-class offline-dump processing pipeline, no distributed crawling, no multi-user or real-time collaboration. Marginalia and OpenAlex remain optional seed sources, not structural dependencies. User-layer features are limited to starring in V1; history logs and inline notes are deferred to V2.


4. High-Level Workflow

Two steps, not interleaved:

  1. Fill the database. Given seed URLs and a crawl policy, crawl outward, store markdown, run a local LLM to fill structured fields, and write rows to SQLite.
  2. Render the UI from the current database. Traversal engine and diffusion mapping, along with qdrant, operates using the database. Then the frontend reads the database and presents the atlas map and the depth-first browser. User writes are limited to stars in V1.

Diffusion-map computation, semantic-similarity edge generation, regional labeling, passage and claim extraction, and FTS5 index population are all caches or derivations over the database. They are rebuilt on demand or on schedule and are fully idempotent — pause, resume, or re-run is always safe. The FTS5 virtual table lives in the same SQLite file as the source rows, which keeps the index trivially consistent and eliminates a cross-process sync step.


5. Crawling

Crawling is delegated entirely to crawl4ai. The previous custom AsyncFetcher, ExtractPool, URL queue, Bloom filter, and relations.extract are removed.

5.1 Per-Session Parameters

Parameter Format / Purpose
Seed URLs Plain text file (seeds.txt) with one URL per line, or JSON array of {"url": "...", "priority": float, "tag": "..."}. Lines beginning with # are comments.
Strategy schedule Ordered list of phases, each {"mode": "bfs"\|"dfs", "depth": int, "max_pages": int}. The crawler runs phase 1 from the seed frontier, then phase 2 from the frontier left by phase 1, and so on. Default: [{"mode":"bfs","depth":1,"max_pages":500}, {"mode":"dfs","depth":5,"max_pages":5000}].
Filter chain ContentTypeFilter (text/html only), DomainFilter (local blacklist + optional per-session allowlist), URLPatternFilter (exclude */tag/*, */page/*, */login*, etc.), optional SEOFilter, optional ContentRelevanceFilter (BM25 against seed query).
Scorer CompositeScorer combining keyword relevance to seed query (weight 0.3), domain authority from prior sessions' mean quality (0.5), path-depth normalization (0.2).
Concurrency MemoryAdaptiveDispatcher with memory_threshold_percent, max_session_permit, and per-domain rate limits with exponential backoff on 429/503. All four values come from config, defaulted per machine profile.
Fetcher mode Per-domain choice between headless=False + text_mode=True (lightweight HTML-only, ~5× faster) and full headless Chromium (needed for JS-rendered sites). See §5.3.
Checkpointing crawl4ai's on_state_change callback flushes the visited set, pending frontier, depth map, and per-phase progress to checkpoints/crawl_<session_id>.json every N results. See §5.4.

5.2 Writer

For each CrawlResult the writer persists: final URL (post-redirect, canonicalized), raw markdown, content-filtered markdown (when a content filter is configured), internal and external link targets with anchor text, and HTTP status. Raw HTML is discarded after extraction. Writes go through a single writer process; document rows are upserted by doc_id = SHA256(canonical_url)[:16] (see §7.4), so re-crawl of a URL updates the existing row rather than inserting a duplicate.

CrawlResult is Crawl4AI's schema, not ours. A Pydantic boundary model DocumentRecord takes a CrawlResult plus context (domain_id, discovered_at, session_id) and produces the exact row shape for the documents table. Validation happens once, at the boundary, before any DB write. See §7.8 for the model layout.

5.3 Static vs. Headless Routing

To maximize throughput, every domain is classified on first contact:

  1. Issue a lightweight GET (no browser) and keep the response if it contains the expected main content.
  2. If the response looks empty or heavily JS-dependent (body text < 200 chars, or a known JS-framework signature), mark the domain render_mode='headless' in domains.

Thereafter the crawler picks the per-domain mode from the domains table. The classification can be overridden per session via config (crawl.render_mode_override: {"arxiv.org": "static"}).

5.4 Pause and Resume

A session is a named run identified by session_id. Crawl state is two things: (a) the crawl4ai strategy state (visited, pending/stack, per-URL depth, phase index), and (b) whatever is already committed to SQLite. SQLite writes are transactional; crawl4ai state is written atomically (write-to-temp, rename). SIGINT triggers a clean flush-and-exit. Resuming a session reloads the JSON state, re-opens SQLite, and continues from the frontier. The phase scheduler respects the phase index so a multi-phase session picks up mid-phase.


6. LLM Enrichment

A local LLM runs as a separate consumer process that reads documents whose enrichment fields are NULL and writes them back. Running the LLM inline inside the crawl loop is explicitly rejected: it serializes the pipeline behind LLM throughput. The decoupled consumer lets the crawler saturate its own bottleneck independently.

The default enrichment model is Qwen 3 4B Instruct — as of early 2026, the strongest open-weight model in the 3–4B range for structured extraction. The previous 8B-class default is retained as a config option for environments where the quality gap matters more than throughput.

The consumer is idempotent and progressive:

  • Only NULL fields are touched; no partial writes are ever committed.
  • Work is batched in chunks of N documents; each chunk is one SQLite transaction, committed atomically.
  • Resume state is implicit in the data: SELECT doc_id FROM documents WHERE editorial_verdict IS NULL ORDER BY discovered_at LIMIT N picks up where the last batch stopped. No separate pipeline_progress table is needed.
  • Model upgrades are signalled by bumping the enrichment_model_version integer on the consumer. The consumer can re-run over all documents with WHERE enrichment_model_version < current to upgrade in place without corrupting earlier work. This single integer replaces the per-field *_source provenance columns of v5.0 — one row-level "which pass wrote this" signal covers every enriched column at once.

6.1 Document-Level Enrichment

One structured call per document fills the spreadsheet-style metadata in one pass:

  • summary — one paragraph (3–5 sentences) on what the document is about.
  • content_type — one of research_paper | blog_post | tutorial | opinion | documentation | news | reference | other.
  • difficulty — one of introductory | intermediate | advanced | expert.
  • evidence_level — one of anecdote | case_study | survey | systematic_review | meta_analysis | primary_data | opinion | explainer.
  • quality_score — integer 0–100; the LLM's judgment of epistemic quality (not writing quality).
  • publication_year — integer (year only; full-date extraction was higher-cost for negligible gain).
  • author — raw string, as extracted. Not resolved to an authors table in V1; disambiguation is deferred.
  • editorial_verdict — one of accept | reject | pending.

External URLs and link targets come from the CrawlResult, not the LLM. Fields dropped from v5.0 (stance, main_arguments, key_entities, is_primary_source, editorial_reason, description, reading_time_min, language) were display-only or redundant with other signals (e.g. is_primary_source is implied by evidence_level ∈ {primary_data, case_study, meta_analysis}); removing them cut eight columns and one LLM-schema branch.

6.2 Passage-Level Enrichment

Runs only on documents with editorial_verdict = 'accept'. A passage is a semantically coherent chunk of four paragraphs or ≥ 350 words, whichever is larger. Passages are cut on paragraph boundaries so offsets into full_text remain stable for anchoring notes and highlights. Per passage, the LLM extracts 1–10 atomic claims. Claim schema is in §7.3.

Document-level summary is the human-readable paragraph. Atomic claims at the passage level are for machine consumption (novelty scoring, claim overlap, diffusion edges). The UI shows the summary for humans and never shows raw claims unless the user opens a debug view.

6.3 Embedding

  • Document embedding: bge-small-en-v1.5 over title + " " + summary once enrichment is done (before that, over title alone). Stored in Qdrant atlas_documents, keyed by doc_id.
  • Passage embedding: same model over the passage text. Stored in Qdrant atlas_passages, keyed by passage_id.
  • Claim embedding: same model over claim_text. Stored in Qdrant atlas_claims, keyed by claim_id.

All three collections use scalar quantization and on-disk HNSW.

6.4 Structured Output with Pydantic

The LLM's output is the single biggest source of "parse this JSON blob and hope it's shaped right" bugs in a system like this. V1 closes that class of bug at the source.

Pydantic schemas define the contract. Every document-level enrichment call returns a DocumentEnrichment(BaseModel) with the eight fields from §6.1 as typed attributes (enums for the ordinal categoricals, conint(ge=0, le=100) for quality_score, etc.). Every passage-level call returns a list[PassageClaim]. These schemas are the only definition of the LLM contract — no parallel JSON-schema file, no hand-written prompt stating the expected shape. The schema is also the prompt; the schema is also the validator.

python class DocumentEnrichment(BaseModel): model_config = ConfigDict(extra="forbid") summary: str content_type: ContentType # Enum difficulty: Difficulty # Enum evidence_level: EvidenceLevel # Enum quality_score: conint(ge=0, le=100) publication_year: int | None author: str | None editorial_verdict: Literal["accept", "reject", "pending"]

extra="forbid" is important: if the LLM hallucinates a field, validation fails loudly rather than the value silently disappearing.

Constrained decoding where possible. The enrichment consumer uses Outlines to bind the Pydantic schema directly into the decoder as a grammar, so the model cannot emit tokens that would produce invalid JSON. This eliminates the "LLM returned malformed JSON" failure mode entirely. Outlines integrates with llama.cpp and vLLM, which cover the local-inference paths we care about. For providers that don't support constrained decoding, instructor serves as the prompt-level fallback: same Pydantic schemas, retries with corrective prompts on validation failure, no architectural change.

Retries are bounded and logged. max_retries = 3 with a corrective prompt on each retry ("your previous response failed validation with: . Return only JSON matching the schema."). Every call — successful or not — is appended as one JSON line to a rotating log at logs/llm_calls.jsonl with fields {call_id, doc_id, stage, prompt_hash, raw_response, parsed_ok, error, model_version, timestamp}. Post-hoc debugging is a jq filter or a one-off pandas.read_json(..., lines=True) load — no schema migration when the log shape evolves, and no write contention with the main SQLite file. The log rolls at 500 MB per file and keeps the last 5 shards by default; llm.debug_log = false in production disables it entirely.

Boundary, not interior. Pydantic models exist at the LLM I/O boundary; the validated output is then either persisted (via SQLModel on the documents table, see §7.8) or passed on as a plain dict. There's no attempt to thread Pydantic types through the whole enrichment pipeline — that's over-engineering. The rule is Pydantic at the boundaries, plain data inside.


7. Database

One SQLite file in WAL mode, opened through a shared connection factory with foreign keys enforced. SQLite is the only structural store (including FTS5 virtual tables for full-text search, §9); Qdrant holds vectors keyed by the same IDs but is treated as a derived cache that can be rebuilt from SQLite.

7.1 Node Tables

Eight structural tables in V1: documents, domains, passages, passage_claims, regions (node-side); citations, domain_links (edge-side); pipeline_runs (mapping-run metadata). A ninth table, stars, is the only user-layer table retained in V1 — trivially small, and the moment you want to revisit a document across sessions without it, the omission stings. history and notes are deferred to V2. The LLM debug trail lives outside SQLite as a rotating JSONL file (see §6.4). All tables are defined via SQLModel (§7.8); the column lists below are the authoritative shape.

documents (21 columns; every column is either queried by the UI, used in traversal scoring, or used to gate pipeline stages — nothing decorative)

Column Type Notes
doc_id TEXT PRIMARY KEY SHA256(canonical_url)[:16]
url TEXT UNIQUE NOT NULL As discovered
canonical_url TEXT NOT NULL After redirects + normalization
domain_id TEXT NOT NULL REFERENCES domains
title TEXT
summary TEXT LLM paragraph-length summary
full_text TEXT Extracted article body as markdown
author TEXT Raw extracted string; not normalized in V1
publication_year INTEGER
content_type TEXT See §6.1
difficulty TEXT See §6.1
evidence_level TEXT See §6.1
quality_score INTEGER 0–100, LLM
editorial_verdict TEXT accept / reject / pending
word_count INTEGER
embedding_model TEXT e.g. bge-small-en-v1.5
enrichment_model_version INTEGER For re-enrichment on upgrade; single provenance signal for every LLM-filled column
discovered_at TEXT NOT NULL ISO 8601
fetched_at TEXT
enriched_at TEXT
updated_at TEXT NOT NULL

Explicitly dropped from v5.0: description, stance, main_arguments, key_entities, editorial_reason, reading_time_min, language, head_fingerprint, render_mode, is_primary_source, publication_date, embedding_source, and all 14 *_source provenance columns. render_mode stays on domains only; is_primary_source is derived from evidence_level at query time.

cluster_id is not a column on documents. The atlas no longer imposes a discrete partition (see §2, §8.3).

domains

Column Type Notes
domain_id TEXT PRIMARY KEY SHA256(domain)[:12]
domain TEXT UNIQUE NOT NULL
is_blacklisted INTEGER 0/1
is_blog INTEGER 0/1
render_mode TEXT static / headless (see §5.3)
doc_count INTEGER Cached
avg_quality REAL Cached mean of documents.quality_score
source TEXT marginalia / discovered

passages (first-class once passage extraction runs)

Column Type Notes
passage_id TEXT PRIMARY KEY {doc_id}:{passage_index}
doc_id TEXT NOT NULL REFERENCES documents
passage_index INTEGER NOT NULL 0-based order in document
text TEXT NOT NULL
char_offset_start INTEGER NOT NULL Into documents.full_text
char_offset_end INTEGER NOT NULL
word_count INTEGER
embedding_model TEXT

passage_claims (first-class once passage extraction runs)

Column Type Notes
claim_id TEXT PRIMARY KEY {passage_id}:{claim_index}
passage_id TEXT NOT NULL REFERENCES passages
doc_id TEXT NOT NULL REFERENCES documents Denormalized for query speed
claim_text TEXT NOT NULL One atomic factual assertion
claim_type TEXT NOT NULL factual_assertion / definition / methodology_step / result / opinion / recommendation
confidence REAL NOT NULL 0–1, LLM self-assessed
negation INTEGER NOT NULL 0 = document asserts; 1 = document refutes
entities TEXT JSON array
embedding_model TEXT

regions (optional, UI annotation only — see §8.3)

Column Type Notes
region_id INTEGER PRIMARY KEY
run_id TEXT NOT NULL Which mapping run produced it
label TEXT NOT NULL TF-IDF top keyword, optionally LLM-refined
centroid_x REAL NOT NULL Diffusion-coord 2D centroid for label placement
centroid_y REAL NOT NULL
bbox_polygon TEXT Optional GeoJSON-style bounding polygon in diffusion coordinates
doc_count INTEGER Documents whose 2D diffusion coord falls in this region's bin

Regions are a lightweight annotation layer produced by the mapping job for floating labels and optional filter chips. They are not consumed by the traversal engine and have no membership table — a point's region is implicit in its 2D position. If diffusion_map.labels.enabled = false, this table stays empty.

7.2 Edge Tables

citations — directed page-level.

Column Type Notes
source_doc_id TEXT NOT NULL REFERENCES documents
target_doc_id TEXT NOT NULL REFERENCES documents
context_snippet TEXT Surrounding text
citation_type TEXT supports / contradicts / extends / reviews / uses_methodology / background / data_source
citation_strength REAL 0–1, heuristic
discovered_at TEXT NOT NULL
PRIMARY KEY (source_doc_id, target_doc_id)

domain_links — directed domain-to-domain. Named for the edge, not the endpoint: one row is one link relationship between two domain nodes (from Marginalia linkgraph or discovered during crawl).

Column Type Notes
source_domain_id TEXT NOT NULL REFERENCES domains
target_domain_id TEXT NOT NULL REFERENCES domains
weight INTEGER Count of underlying page-level links
source TEXT marginalia / crawl
PRIMARY KEY (source_domain_id, target_domain_id)

stars — the V1 user-layer table. Small, cheap, and the minimum needed for "I want to come back to this document later." Seed flag feeds back into the crawler (starred docs become seed URLs for future sessions).

Column Type Notes
doc_id TEXT NOT NULL REFERENCES documents
collection TEXT NOT NULL DEFAULT 'default' Namespacing for ad-hoc grouping
starred_at TEXT NOT NULL ISO 8601
is_seed INTEGER NOT NULL DEFAULT 0 0/1; flags docs to be reused as seeds on future crawls
PRIMARY KEY (doc_id, collection)

Tables dropped from v5.0: authors, authorship, anchors, history, notes. Author identity resolution, click-history, and inline annotations are deferred to V2. The author column on documents is a raw extracted string — sufficient for display and grouping, not sufficient for cross-document author analytics. Anchor-text signal is dropped as an edge-weight input in §8.1 to match.

7.3 Housekeeping Tables

pipeline_runs — the only housekeeping table. One row per diffusion-map / mapping rebuild (run_id, started_at, finished_at, doc_count, edge_count, eigensolver, params_json, is_active). Exactly one row has is_active = 1, and that's what the visualizer and route service (§10.6) read.

Tables dropped from v5.0: meta (schema version is tracked by Alembic, §7.8), pipeline_progress (resume state is implicit — any document with NULL editorial_verdict or stale enrichment_model_version is work-to-do). LLM debug output lives in logs/llm_calls.jsonl (§6.4), not in SQLite.

7.4 IDs and Determinism

  • doc_id = SHA256(canonical_url)[:16] — deterministic, so re-crawl or concurrent insert of the same URL hits the same row.
  • domain_id = SHA256(domain)[:12].
  • passage_id = "{doc_id}:{passage_index:04d}".
  • claim_id = "{passage_id}:{claim_index:02d}".

Canonicalization rules for URLs: lowercase scheme+host, drop default port, drop trailing slash on paths, strip fragment, remove known tracking params (utm_*, fbclid, etc.), keep query params in sorted order.

7.5 Removed Tables

Removed Replacement
taxonomy_nodes, taxonomy_membership Data-driven regions in diffusion coordinates
clusters, cluster_membership Continuous diffusion-coordinate structure + optional regions annotation table (§7.1)
Generic relations.hypergraph edge store Dedicated per-edge-type tables above
feeds Dropped — RSS polling is out of scope; seed URLs cover the entry points
authors, authorship V1: raw author string column on documents. V2: re-add once author disambiguation is needed
anchors Dropped as a diffusion edge-weight input; anchor text is not used in V1
stars, history, notes stars is retained as the single V1 user-layer table (§7.2); history and notes deferred to V2
meta Alembic tracks schema version (§7.8)
pipeline_progress Resume state is implicit in NULL fields and enrichment_model_version

7.6 Derived, Not Stored

Co-citation edges, semantic-similarity edges (Qdrant k-NN above a cosine threshold), and claim-overlap edges (claim-embedding cosine above a threshold) are not persisted as tables. They are computed at diffusion-map build time from the claim/document embeddings in Qdrant and live only as rows in the weighted adjacency matrix for that run.

7.7 Indexes

CREATE INDEX idx_docs_domain ON documents(domain_id); CREATE INDEX idx_docs_editorial ON documents(editorial_verdict); CREATE INDEX idx_docs_content_type ON documents(content_type); CREATE INDEX idx_docs_fetched ON documents(fetched_at); CREATE INDEX idx_docs_enriched ON documents(enriched_at); CREATE INDEX idx_docs_enrichment_version ON documents(enrichment_model_version); CREATE INDEX idx_citations_target ON citations(target_doc_id); CREATE INDEX idx_domain_links_target ON domain_links(target_domain_id); CREATE INDEX idx_passages_doc ON passages(doc_id); CREATE INDEX idx_claims_doc ON passage_claims(doc_id); CREATE INDEX idx_regions_run ON regions(run_id); CREATE INDEX idx_stars_collection ON stars(collection);

Indexes removed along with their tables: idx_cluster_mem_cluster, idx_authorship_author, idx_anchors_dest, idx_history_doc, idx_notes_doc.

7.8 SQLModel, Alembic, and Inspection

SQLModel replaces raw SQLite access for application logic. Each table from §7.1–§7.3 is declared as a SQLModel class — columns become type-hinted attributes, which gives autocomplete in the IDE, mypy type checking, and refactor-by-rename. SQLModel is Pydantic-compatible: the same class validates row shape on the way in and persists on the way out, so the boundary model from §6.4 can often be the DB model.

python class Document(SQLModel, table=True): __tablename__ = "documents" model_config = ConfigDict(extra="forbid") doc_id: str = Field(primary_key=True) url: str = Field(unique=True, index=True) canonical_url: str domain_id: str = Field(foreign_key="domains.domain_id", index=True) title: str | None = None summary: str | None = None full_text: str | None = None author: str | None = None publication_year: int | None = None content_type: str | None = Field(default=None, index=True) difficulty: str | None = None evidence_level: str | None = None quality_score: int | None = None editorial_verdict: str | None = Field(default=None, index=True) word_count: int | None = None embedding_model: str | None = None enrichment_model_version: int = 0 discovered_at: str fetched_at: str | None = None enriched_at: str | None = None updated_at: str

Write-path split. Application logic (route queries, candidate scoring, detail-panel fetches) uses the SQLModel ORM. Bulk ingestion (crawl-time row insert at 10K+ rows/minute) uses raw executemany or SQLModel's bulk-insert path — the per-row ORM overhead is not worth paying at ingestion scale. This is the same split Crawl4AI already assumes.

Alembic for migrations. Schema changes are authored as Alembic revisions auto-generated by diffing the SQLModel classes against the live DB. Migrations are forward-only and idempotent; the meta table is unnecessary because Alembic tracks its own alembic_version row.

Datasette for visibility. pip install datasette && datasette relations.db gives a browsable web UI with faceted search, a SQL console, and a JSON API over every table. It is kept running in a background tab during development — no custom admin UI is in scope. A companion Jupyter notebook with ~12 canned queries (row counts by content_type, NULL rates per enrichment column, most-recent documents, enrichment version distribution) covers the other 10% of inspection needs.

/scripts/db_inspection_report.py CLI. A single-command health check: prints to a text file for row counts, NULL rates by column, distinct-value cardinalities for categoricals, and a SQLModel-vs-DB schema diff that catches drift. Runs in <2 s against a 10M-row DB and is expected to be the first command run at the start of every session.


8. Diffusion Map Layout

A single offline job reads from the database, writes a mapping Parquet and a persisted adjacency artifact (see §8.5), and optionally populates the regions table. The visualizer reads these artifacts and never triggers this job. Every run gets a row in pipeline_runs; exactly one run is flagged is_active = 1, which is what the visualizer and route service read. Previous runs are retained for rollback.

8.1 Adjacency Matrix

Weights are a linear combination of edge-type contributions, each normalized per edge type before summing. Coefficients are config values; defaults:

Edge Default weight Source
Citations supports / extends 1.0 citations with citation_type
Citations background / other 0.5 citations
Co-citations 0.4 citations self-join at build time
Semantic similarity 0.5 Qdrant k-NN (k=20) on atlas_documents, cosine ≥ 0.70
Claim overlap 0.6 Qdrant k-NN on atlas_claims, cosine ≥ 0.85; edge weight = count of matched pairs; claims are considered "overlapping" when embedding cosine is above threshold, so paraphrases count
Domain links 0.1 domain_links

Which edge types are included is controlled by diffusion_map.edges: [...] in config — any subset can be disabled for a run (useful for ablation). Anchor-text and shared-author edges are no longer supported in V1; re-add once the corresponding tables come back.

8.2 Computation

  1. Load the enabled edge sets via the infolake.core.graph helper that wraps networkx.DiGraph. Add derived edges (co-citations, semantic, claim overlap) at this stage.
  2. Build the weighted adjacency matrix W via networkx.to_scipy_sparse_array().
  3. Compute the degree matrix D and the anisotropic-normalized kernel: K = D^{-α} W D^{-α} with α = 1 (default) to decouple sampling density from geometry.
  4. Row-normalize to the random-walk transition matrix P = D_K^{-1} K.
  5. Symmetrize via S = D_K^{1/2} P D_K^{-1/2} so the spectrum is real.
  6. Compute the top-k eigenvectors (k = 10 default) via:
  7. GPU path (default when available): cupyx.scipy.sparse.linalg.eigsh for Lanczos on GPU.
  8. CPU path (fallback): scipy.sparse.linalg.eigsh (ARPACK). Adequate up to ~5 M nodes.
  9. Very-large path (>10 M nodes): LOBPCG (scipy.sparse.linalg.lobpcg or GPU equivalent) with a deflated preconditioner. Configurable via diffusion_map.eigensolver.
  10. Drop the trivial constant eigenvector.
  11. Sign fixing. Eigenvectors are only defined up to sign. To keep coloring and labels stable across rebuilds, orient each retained eigenvector so that its dot product with the corresponding eigenvector from the previous active run (projected onto shared docs) is positive. On the first run, fix the sign so that the document with the lexicographically smallest doc_id among the top-10% absolute values receives a positive component.
  12. Take eigenvectors 2 and 3 as the 2D map coordinates (x, y). Retain all k for the traversal engine's diffusion-distance computation (Euclidean distance in the full k-dim diffusion space, weighted by eigenvalues to the power of t — default t = 1).

Hyperparameters (α, t, k, k-NN k, cosine thresholds, and eigensolver choice) are config values under diffusion_map.*.

8.3 Regional Annotation

Clustering as a first-class partition is removed. The diffusion coordinates already position documents meaningfully; the atlas communicates structure via continuous coloring and floating regional labels.

Continuous coloring. Color is a deterministic function of the eigenvectors, computed per document by the mapping job and stored in the mapping Parquet as pre-baked RGB values (so the visualizer just reads and paints):

  • Hue from eigenvector 2, mapped through a perceptually uniform colormap (e.g. HSLuv hue).
  • Saturation from eigenvector 3.
  • Lightness from quality_score (0–100 → 0.3–0.9), so higher-quality documents pop against the background.

The eigenvector sign-fixing step in §8.2 keeps the color scheme stable across runs on unchanged regions.

Regional labels. Label generation is a sliding-window TF-IDF pass over document summary text, binned on a coarse 2D grid over the diffusion coordinates:

  1. Tile the 2D plane into ~200 bins (exact count configured; square grid with axis-aligned bin size chosen so the most populated bins hold ~50–500 docs).
  2. For each bin with ≥ diffusion_map.labels.min_bin_docs documents (default 20), compute TF-IDF over the bin's aggregated summary text against the full-corpus background.
  3. The top keyword for each populated bin becomes its candidate label.
  4. If diffusion_map.labels.refine_with_llm = true, send the top ~30 bins (by doc count) to the LLM in one structured call for readability polish — e.g. turn "topology, optimization" into "Topology Optimization". Cheap (one call per run, negligible alongside the eigensolve).
  5. Each accepted label becomes one row in regions with its bin's centroid.

Optional HDBSCAN-based regions. When diffusion_map.labels.source = "hdbscan" the job runs HDBSCAN on the 2D diffusion coordinates to find dense components and uses the component's own summaries for TF-IDF instead of grid bins. HDBSCAN here is a utility for finding well-populated regions to label, not a first-class partition; documents are not assigned a region membership, and the traversal engine does not consult HDBSCAN output.

8.4 Incremental Updates

When new documents arrive between rebuilds, project them into the existing diffusion space via the Nyström extension:

  1. For each new document, compute its edges to existing documents (citations already stored; semantic/claim-overlap via Qdrant k-NN against existing vectors).
  2. Form a new row of the kernel matrix k_new ∈ R^N (one entry per existing document; most are zero thanks to sparsity).
  3. The new document's diffusion coordinates are approximated as y_new[j] = (1 / λ_j) · k_new · v_j for each retained eigenvector v_j with eigenvalue λ_j, using the stored eigendecomposition.
  4. Compute and store the new document's baked RGB color from its eigenvector components (same formula as §8.3). That is the full Nyström step — no cluster-assignment, no provisional flag, no membership table.

Drift detection as the primary rebuild trigger. After each incremental batch:

  1. Sample a subgraph of diffusion_map.drift_sample_size landmark documents (default 1000, biased toward documents touched by the latest batch or their neighbors).
  2. Recompute the top ~5 eigenvectors restricted to this subgraph using the current full adjacency.
  3. Compare the restriction of the stored eigenvectors to these landmarks with the freshly computed ones via Procrustes alignment; record the Procrustes distance.
  4. Trigger a full rebuild when that distance exceeds diffusion_map.drift_threshold (default 0.15).

A corpus-size rule (full rebuild when the corpus grows by ≥ 50% since the last full rebuild) remains as a ceiling in case drift detection undercounts. Config changes (edge weights, inclusion list, α, t, k) force an immediate full rebuild.

8.5 Output

The mapping job writes three artifacts per run, keyed by run_id:

  • mapping_<run_id>.parquet. Columns: doc_id, dim_1dim_k, color_r, color_g, color_b, and joined metadata for tooltips (title, domain, quality_score, content_type, summary-snippet). No cluster_id or cluster_label. The visualizer reads dim_1, dim_2, the baked RGB, and the joined metadata for the scatterplot; the traversal engine and route service read the full k-dim coordinates plus doc_id index.
  • adjacency_<run_id>.npz. The row-symmetrized weighted adjacency as a scipy.sparse.csr_matrix. Consumed by the route service (§10.6).
  • node_index_<run_id>.parquet. Two columns (row_index, doc_id) giving the row/column order of the adjacency matrix. Consumed by the route service to map between doc_id and matrix indices.

Writes are staged to a temp path and atomically renamed. After both the Parquet and the adjacency are in place, pipeline_runs.is_active is flipped to the new run_id in a single transaction.


9. Full-Text Search (SQLite FTS5)

Full-text search runs inside the same SQLite file as the source rows, via the built-in FTS5 virtual-table extension. Meilisearch (v5.0's external FTS process) is removed — one fewer process to supervise, no cross-store consistency to maintain, and search queries join naturally against the other tables.

Two FTS5 virtual tables, both contentless_delete='1' so content lives only in the source tables:

  • documents_fts — keyed by doc_id, indexed columns title, summary, author. Populated after each enrichment batch commits. Used by /api/search for the atlas's dim-non-matches UX and for free-text filter-chip queries.
  • passages_fts (optional, enabled via fts.passages_indexed = true) — keyed by passage_id, indexed column text. Used by the detail panel's "search inside this corpus" mode.

Filtering is done in SQL by joining FTS matches against documents and applying predicates on content_type, evidence_level, quality_score, publication_year, and editorial_verdict — the same filterable-field set Meilisearch exposed, now expressed as ordinary WHERE clauses. The atlas's spatial filters (polygon lasso, k-NN radius, §11.1) are still computed frontend-side on the mapping Parquet.

FTS5 is maintained by triggers on documents (insert / update / delete → corresponding FTS row), so no manual sync step is needed. A one-shot rebuild is available via doctor.py --rebuild-fts when index corruption is suspected.

Ranking uses FTS5's default BM25, adequate for the scale and query shape here. Typo tolerance and prefix search — Meilisearch's signature features — are not needed for atlas navigation (the user is looking at points, not typing query strings as their primary affordance), and can be added later via trigram indexing if they become necessary.

cluster_id is not a searchable field — use polygon lasso or k-NN radius on the atlas for spatial filtering (§11.1).


10. Traversal Engine

A pure Python function over SQLite + Qdrant, independent of visualization. It drives the depth-first browser and also powers route-finding (§10.6).

10.1 Inputs

  • current_doc_id — a doc_id from the documents table. Because doc_id = SHA256(canonical_url)[:16] (§7.4), the same URL always maps to the same ID across sessions, across machines, and across re-crawls. The caller obtains current_doc_id from a click event (atlas hover tooltip → click, breadcrumb node, or explicit search result).
  • session_history:
  • seen_doc_ids: set of doc_ids visited this session.
  • seen_claim_ids: union of claim_ids from seen_doc_ids (computed once, updated on each step).
  • session_centroid: running mean of the embeddings of seen_doc_ids (updated on each step).

10.2 Candidate Collection (union, deduplicated)

Three sources, now genuinely distinct:

  • Citation neighbors — outgoing and incoming via citations, with the citation_type carried through. Structural signal.
  • Embedding-space k-NN — Qdrant k-NN on the document embedding (k = 50, cosine ≥ 0.60, excluding seen_doc_ids). Pure semantic signal, taxonomy-agnostic.
  • Diffusion-space k-NN — top-M documents by Euclidean distance in the full k-dim diffusion space (eigenvalue-weighted), excluding seen_doc_ids. Manifold-aware signal that reflects graph connectivity, not just content similarity. Default M = 50.

Diffusion-space k-NN replaces the previous "cluster neighbors" source; it provides the same orientation benefit (nearby-on-the-manifold documents) without requiring a discrete partition.

10.3 Scoring

Each candidate gets a composite score. All sub-signals are normalized to [0,1]; the final score is a config-weighted sum (default weights shown):

Component Weight Signal
Marginal information gain 0.40 Count of candidate claims not in seen_claim_ids divided by candidate's total claim count. Falls back to Euclidean distance from session_centroid in embedding space when passage extraction hasn't run.
Epistemic quality 0.25 Composite of evidence_level (ordinal rank) and quality_score / 100 — both stored in documents per §7.1. is_primary_source is no longer a column; treat evidence_level ∈ {primary_data, case_study, meta_analysis} as the "primary source" signal when the session flag is set.
Diffusion distance 0.20 Distance from current_doc_id in the k-dim diffusion space. Bimodal preference: the engine returns one "close" and one "far" candidate when possible, not uniformly closest.
Citation relationship 0.15 Present if a direct citations edge exists: supports/extends rank high, contradicts ranks high for deliberate contrast sessions, background ranks low.

10.4 Hard Filters

editorial_verdict = 'accept', domain blacklist exclusion, and optionally evidence_level IN ('primary_data', 'case_study', 'meta_analysis') when the session requests primary-source mode. The ordinal categoricals stay in SQL rather than as Python enums inside the scoring loop — keeps the filter cheap and indexable.

10.5 Output

Top-K candidates (default K = 3), each with:

  • doc_id, title, domain, summary.
  • A one-line reason derived from the dominant scoring component, e.g. "adds 4 novel claims about X," "contradicts current doc on Y," "primary source for Z," "bridges to a distant region."
  • largely_overlaps flag: set when seen-claim overlap ≥ 70%; the UI dims these candidates but returns them (dim-don't-hide).

10.6 Route-Finding

In addition to single-step candidates, the engine supports start-to-end route planning. The user picks a start document and an end document (from the atlas via right-click, search, or a star). The engine returns N alternative routes (default N = 3) through the diffusion graph.

Algorithm.

  1. Load the persisted adjacency for the currently active pipeline_runs row: adjacency_<run_id>.npz + node_index_<run_id>.parquet (§8.5). Lazily construct a networkx.DiGraph on first request and hold it for the life of the backend process per active run_id.
  2. Edge-type gating. The request carries edge_types ∈ {"strict", "loose"} (config default routes.edge_types = "strict"):
  3. Strict: use the citation subgraph only (subset the adjacency to rows/cols where the underlying edge was a citation). Most legible routes; may fail to find a path when citation structure is sparse.
  4. Loose: use the full weighted adjacency. Backend attempts strict first; on failure to find any path (or fewer than N paths), falls back to loose and sets edge_types_used = "loose_fallback" in the response so the UI can indicate that.
  5. Edge cost on the chosen subgraph is 1 / (weight × epistemic_quality(target)) so high-quality, high-connectivity paths are cheapest.
  6. Compute top 10 candidate routes via networkx.shortest_simple_paths (Yen-style generator), capped at a max hop count (default routes.max_hops = 8).
  7. Greedy diversification. From those 10, greedy-select N (default 3) that maximize pairwise Jaccard distance on intermediate node sets: seed with the cheapest route; each subsequent pick maximizes the minimum Jaccard distance to already-picked routes.
  8. For each selected route, run a simulated forward pass of the traversal engine along the route: at each hop, compute the dominant scoring component relative to the simulated seen-set at that position, and attach a one-line reason ("adds 4 novel claims," "contradicts previous hop," etc.).
  9. Compute per-route metadata:
  10. total_cost (sum of edge costs).
  11. hop_count.
  12. mean_quality (mean quality_score of intermediate documents).
  13. diffusion_distance_traveled (sum of Euclidean distances between consecutive hops in k-dim diffusion space).
  14. via: 2–3 noun phrases summarizing the dominant topics of intermediate documents (TF-IDF on their summary text against the corpus background).

Caching. LRU cache on the route service, keyed by (start_doc_id, end_doc_id, run_id, edge_types); default capacity 256 entries.

Upstream contract. The mapping job (§8.5) is required to persist both the adjacency and the node index. Route-finding never triggers diffusion-map recomputation.

The route-finding API lives alongside the candidate API in the backend (§11.3).


11. Visualization and Frontend

The frontend is a small web app served by a local Python backend. It imitates a simplified Google Maps: one primary canvas (the atlas map) with a detail drawer that opens into the depth-first browser. A search bar sits above the canvas.

11.1 Atlas View

deck.gl scatterplot of diffusion coordinates (dim_1, dim_2), colored by the pre-baked RGB values from the mapping Parquet (§8.3), sized by quality_score. Points are pickable; hover shows a compact tooltip (title, domain, one-line summary, quality_score).

Semantic zoom.

  • Zoomed out: 2D-KDE contours or density hexbins drawn beneath the point cloud, with points fading to low alpha. Regional labels (from the regions table) float over populated areas.
  • Mid zoom: Hexbins fade out, points fade in, labels still visible for coarser regions only.
  • Zoomed in: Full scatter. Only the most specific regional labels visible; coarse labels fade.

On-screen element count stays roughly constant across zoom levels.

Interaction.

  • Single click on a point → opens the depth-first browser anchored on that document.
  • Double click on a point → opens the external URL in a new tab.
  • Right click on a point → context menu with "Set as start" / "Set as end" for route-finding (§11.2). Once start is set, a persistent badge marks the start point and a floating panel shows: "Route from [Title]. Click a point or use search to set end. Esc to cancel."
  • Polygon lasso (shift-drag) → filter the atlas to points inside the drawn polygon. Matching points stay at full alpha; non-matches dim to 0.08. Replaces the old "filter to cluster X" UX.
  • k-NN radius from point (alt-click) → filter to the clicked point's N nearest neighbors in diffusion space.

FTS5 routes search queries; matches dim non-matching points via GPU alpha channel (alpha = 0.85 for matches, 0.08 for non-matches).

Edges are not drawn on the atlas (density and color already communicate structure), except in route mode: see §11.2.

11.2 Depth-First Browser and Route Mode

Free navigation (default). User clicks a point on the atlas → browser opens on that doc → traversal engine returns K candidates → user clicks one → it becomes current; the previous collapses into a breadcrumb above.

Route mode. Triggered by setting a start and an end in the atlas (§11.1). The flow:

  1. Backend calls /api/routes with start_doc_id, end_doc_id, n_routes, max_hops, edge_types. Response drives two UI surfaces simultaneously:
  2. Atlas overlay. Draw the N selected route polylines as a selective overlay (edges-on-atlas rule bends only here; ~24 edges total for the default N=3, max_hops=8). Each route gets a distinct color; non-route points dim to ~0.3 alpha. Hovering a route card in the sidebar thickens that route's polyline.
  3. Sidebar route-comparison panel. Vertical stack of N cards, one per route. Each card shows: hop count + total cost badge row, the via phrase, first and last interior document titles (orientation anchors), mean quality and diffusion-distance-traveled as small bar indicators.
  4. Commit to a route. Clicking a card makes it the active route. The depth-first browser switches to Google-Maps-step-by-step style:
  5. Breadcrumb at top becomes the full route: past hops solid, current highlighted, upcoming dimmed with a thin connector.
  6. Traversal-engine candidate slot 1 is pinned to the next hop on the route, with a distinct background and a "Continue route →" affordance.
  7. Slots 2 and 3 are normal free-navigation candidates (computed on-the-fly as usual) so the user can see adjacent non-route options without leaving route mode.
  8. Deviation handling (V1, strict). Clicking a non-route candidate abandons the route. A confirmation toast appears ("Route abandoned. You can resume from the sidebar."), and the UI reverts to free navigation with full session history preserved. Adaptive rerouting is deferred to V2.
  9. Exit. Persistent "Abandon route" button in the breadcrumb; Esc also abandons. Reaching the end document triggers an "Arrived" state with a brief completion badge.

The detail panel and stars behave identically in free and route mode. (History and notes are V2.)

11.3 Backend API

The backend is a FastAPI app. Every request body and response is a Pydantic model — which means FastAPI auto-generates the OpenAPI schema at /docs, the frontend's TypeScript types can be generated from that schema rather than hand-maintained, and model_config = ConfigDict(extra="forbid") on request models catches typo'd query params at the edge instead of in application code. The table below is the contract; the Pydantic classes in infolake.backend/api/schemas.py are the implementation.

All endpoints return JSON; all are GET unless noted.

Endpoint Purpose Key params
/api/map-data Initial map load; returns mapping Parquet as Arrow/JSON including baked RGB run_id (optional, defaults to active)
/api/regions List regional labels for label layer run_id (optional)
/api/doc/<doc_id> Full document metadata
/api/search SQLite FTS5 query against documents_fts (and passages_fts when enabled) q, filters, limit
/api/citations/<doc_id> 1-hop or 2-hop citation neighborhood for detail panel hops
/api/candidates Traversal engine's top-K next candidates current_doc_id, seen_doc_ids[]
/api/routes Route-finding: N diversified routes between two docs start_doc_id, end_doc_id, n_routes, max_hops, edge_types (strict / loose)
/api/route-progress Per-hop lookahead while walking a route route_id, position
/api/passages/<doc_id> Passages + claims for a document
/api/stars List stars, optionally filtered collection
/api/star (POST) Add/remove a star body: {doc_id, collection, is_seed}
/api/stats Global counts for the sidebar

Endpoints dropped from v5.0: /api/history, /api/notes/<doc_id> — the underlying tables are not in V1 (§7.2).

/api/routes response shape (declared as a Pydantic RouteListResponse model; shown here as JSON for readability):

{ run_id: "...", edge_types_used: "strict" | "loose" | "loose_fallback", routes: [ { route_id: "...", total_cost: 3.42, hop_count: 5, mean_quality: 0.71, diffusion_distance: 1.88, via: ["topology optimization", "FEM"], hops: [ { doc_id, title, reason, ... }, ... ] }, ... ] }

/api/route-progress response shape:

{ next_hop, remaining_hops, seen_claim_overlap_to_next }

Everything the frontend needs is here; nothing is left as "implementation detail."

11.4 Theming

Dark theme throughout; cyan accent. All signals, including words and graphs, use clearly visible color contrast. Specifics (exact hex values, sidebar widths) are frontend implementation and not frozen in this spec.


12. Storage Layout

All persistent state — SQLite DB (including FTS5 virtual tables), mapping Parquet + adjacency artifact + node index, Qdrant collections, LLM weights, crawl checkpoints, session state files, and the logs/llm_calls.jsonl debug log — lives on the host's fastest available disk, configured via storage.data_dir. RAM holds the active working set: Qdrant HNSW index, SQLite WAL buffer, diffusion-map coordinates for the active run, the lazily-constructed route graph, and the running LLM. A memory governor caps total usage; the cap is a config value so the same code runs on a constrained laptop and on a multi-GPU server (optional, don't implement during MVP phase).


13. Configuration

config.json is parsed into a Pydantic AppConfig(BaseModel) at startup with model_config = ConfigDict(extra="forbid"). Typos like max_hop vs. max_hops fail loudly at boot instead of silently defaulting to the wrong value; defaults are declared as Pydantic field defaults so "what keys exist and what do they mean" is answered by reading one file. The sub-models below are the authoritative list of keys.

  • crawl.concurrency, crawl.memory_threshold_percent, crawl.rate_limits, crawl.render_mode_override.
  • crawl.strategy_schedule — the BFS/DFS phase list.
  • llm.model (default qwen3-4b-instruct), llm.batch_size, llm.device, llm.max_retries (default 3), llm.constrained_decoding (outlines / instructor / none, default outlines when the backend supports it), llm.debug_log (bool, default true in dev / false in prod).
  • embedding.model, embedding.device, embedding.batch_size.
  • diffusion_map.alpha, diffusion_map.t, diffusion_map.k, diffusion_map.knn_k, diffusion_map.semantic_threshold, diffusion_map.claim_threshold, diffusion_map.eigensolver (cupy_eigsh / scipy_eigsh / lobpcg), diffusion_map.edges (inclusion list), diffusion_map.edge_weights.
  • diffusion_map.coloring.hue_eigenvector (default 2), diffusion_map.coloring.sat_eigenvector (default 3), diffusion_map.coloring.lightness_source (default quality_score), diffusion_map.coloring.colormap (default hsluv).
  • diffusion_map.labels.enabled (bool, default true), diffusion_map.labels.source (grid_tfidf default / hdbscan), diffusion_map.labels.window_size (grid bin count), diffusion_map.labels.min_bin_docs, diffusion_map.labels.refine_with_llm (bool).
  • diffusion_map.drift_threshold (default 0.15), diffusion_map.drift_sample_size (default 1000).
  • traversal.k, traversal.weights, traversal.primary_source_only.
  • routes.n_routes, routes.max_hops, routes.edge_types (strict / loose), routes.cache_capacity.
  • fts.passages_indexed (bool, default false) — when true, passages_fts is populated alongside documents_fts (§9).
  • storage.data_dir, storage.memory_cap_gb.

Removed from v5.0: meilisearch.passages_indexedfts.passages_indexed (§9). Removed from v4.x: diffusion_map.clustering.* — discrete clustering is no longer a first-class step.

Defaults ship for two profiles (workstation, gpu_server); a profile selects a starting set and individual keys can be overridden. A workstation refers to Mac Mini M4 with 24 GB RAM. A GPU server refers to a Debian server with 5× 3090 GPUs. Storage doesn't need to be monitored.


14. What Was Removed and Why

  • IPTC + OpenAlex taxonomy, SPARQL loader, taxonomy Qdrant collection, taxonomy_nodes / taxonomy_membership tables → data-driven regions in diffusion coordinates.
  • Discrete clustering on diffusion coordinates (HDBSCAN on dim_1..dim_k as a partition), clusters / cluster_membership as first-class schema, cluster_id as a document attribute → continuous eigenvector-based coloring + optional regions annotation table. HDBSCAN remains as an optional utility inside the labeling pass (§8.3), not a schema-level partition.
  • relations.hypergraph generic edge store → dedicated typed edge tables.
  • UMAP on raw embeddings → diffusion-map eigendecomposition.
  • Custom AsyncFetcher, ExtractPool, URL queue, Bloom filter dedupcrawl4ai.
  • relations.extract / trafilatura routingcrawl4ai directly.
  • feeds table and RSS polling → out of scope; seeds cover entry points.
  • Category theory, hypergraph formalism, flow matching → diffusion geometry + deterministic traversal engine.
  • Taxonomy-browser frontend view → atlas view's continuous coloring + regional labels.
  • Three-phase A/B/C pipeline taxonomy → two-step fill-database / render-UI loop.
  • difficulty_match in traversal scoring → redundant with diffusion distance.
  • cluster_id as a Meilisearch filter → polygon lasso / k-NN radius on the atlas.
  • Single Yen-only route outputshortest_simple_paths generator feeding greedy-diversified top-N + simulated forward-pass reasons.

v5.1 additional drops:

  • Meilisearch as an external FTS process → SQLite FTS5 virtual tables in the same DB file. One less process to supervise, no cross-store consistency to maintain, FTS queries compose as ordinary SQL joins.
  • Display-only and redundant documents columnsdescription, stance, main_arguments, key_entities, editorial_reason, reading_time_min, language, head_fingerprint, render_mode (kept on domains only), is_primary_source, publication_date (kept publication_year). Either decorative or derivable from columns that remain; 10 fewer columns and one fewer LLM-schema branch.
  • 14 per-field *_source provenance columns → single enrichment_model_version integer on documents. One row-level "which pass wrote this" signal covers every LLM-filled column at once; re-enrichment on model upgrade is WHERE enrichment_model_version < current.
  • authors and authorship tables → raw author string column on documents. Author disambiguation is a V2 concern once cross-document author analytics become necessary; until then the string is fine for display and grouping.
  • anchors table → dropped as a diffusion-map edge-weight input in V1. If revived, it will live as a read-only Parquet artifact alongside the Marginalia dump, not as a hot-path SQL table.
  • history and notes tables → deferred to V2. V1 is a single-session reading tool; stars alone covers the "come back to this later" primitive.
  • meta and pipeline_progress tables → Alembic tracks schema version; resume state is implicit in NULL fields and enrichment_model_version. Two fewer tables to keep consistent.
  • llm_calls as a SQL table → rotating logs/llm_calls.jsonl. No schema migration when the log shape evolves, no write contention with the main DB, and jq / pandas are better debugging tools than SQL for a raw-response log.
  • Llama-3-8B-class default enrichment model → Qwen 3 4B Instruct — the strongest open-weight model in the 3–4B range for structured extraction as of early 2026, with clear throughput wins on constrained hardware. The 8B slot is retained as a config option.
  • Hand-parsed LLM JSON and hand-maintained API schemas → Pydantic models at every boundary (LLM I/O via Outlines + instructor, HTTP API via FastAPI, config loading, CrawlResult → row normalization). Typos and shape mismatches fail loudly at the boundary instead of silently degrading downstream.

15. What Was Kept

  • SQLite with WAL mode as the single structural store.
  • bge-small-en-v1.5 as the embedding model.
  • Qdrant for vector search (per-doc, per-passage, per-claim collections).
  • SQLite FTS5 for full-text search — same DB file as the source rows; no separate process.
  • deck.gl (via pydeck) for GPU-accelerated map rendering.
  • A local LLM (Qwen 3 4B Instruct by default; 8B-class available via config) for enrichment, run as a separate consumer.
  • Stars as the sole V1 user-interaction primitive (history and notes deferred to V2).
  • enrichment_model_version as the single provenance signal on documents, replacing the 14-column *_source stack from v5.0.
  • Crash-safe checkpoints via crawl4ai resume state.
  • Pydantic at every I/O boundary — LLM structured output (Outlines for constrained decoding / instructor as prompt-level fallback), config loading with extra="forbid", FastAPI request/response schemas, and CrawlResultDocumentRecord normalization.
  • SQLModel for application-layer DB access (ORM with type hints; same class validates and persists); raw executemany retained on the bulk-ingestion path.
  • Alembic for forward-only, idempotent schema migrations — no hand-written meta table.
  • Datasette + a Jupyter notebook + a doctor.py CLI as the DB inspection stack.
  • HDBSCAN, as an optional region-finding utility inside the labeling pass (§8.3), not a first-class partition. Available via diffusion_map.labels.source = "hdbscan".
  • Yen-style K-shortest-paths via networkx.shortest_simple_paths, as the candidate pool feeding greedy diversification (§10.6).