Canonical source

This page is a one-to-one mirror of README.md at the repository root. Edit that file, not this one.

Truth Atlas UI

Infolake (Truth Atlas)¶

Infolake / Truth Atlas is an end-to-end pipeline for turning a seed list of URLs into a browsable 2-D map of the web, coloured by topic and connected by citation / semantic edges. It crawls, enriches with a local LLM, embeds, builds a diffusion-map layout, and serves everything through a FastAPI backend and a React + deck.gl frontend.

The system is designed for a single GPU-accelerated workstation (tested on 5× RTX 3090 with CUDA 12.8) and is fully containerised via Docker Compose. vLLM runs with all GPUs; Qdrant holds the vector caches; SQLite + FTS5 is the structural store.

Topology¶

flowchart LR
  subgraph host [Host, GPUs 0..4]
    subgraph compose [docker compose]
      frontend[frontend<br/>nginx :8080]
      backend[backend<br/>FastAPI :8000]
      worker[worker<br/>enrich-loop]
      vllm["vllm<br/>OpenAI API :8765"]
      qdrant[qdrant :6333]
      prom[prometheus<br/>profile=observability]
      dcgm[dcgm-exporter<br/>profile=observability]
      graf[grafana :3000<br/>profile=observability]
    end
    gpus[["NVIDIA GPUs"]]
  end
  frontend -->|/api/*| backend
  backend  -->|HTTP| qdrant
  backend  -->|HTTP| vllm
  worker   -->|HTTP| vllm
  worker   -->|HTTP| qdrant
  worker   -->|sqlite| vol[("./data bind mount")]
  backend  --> vol
  vllm     -->|"nvidia runtime"| gpus
  dcgm     -->|NVML| gpus
  prom     --> backend
  prom     --> dcgm
  graf     --> prom

Frontend — React 18 + Vite + deck.gl atlas view, shipped as a static bundle served by nginx.
Backend — FastAPI at /api/* (/api/health, /api/ready, /api/map-data, /api/docs, /api/candidates, /api/routes, /api/stars, …).
Worker — persistent infolake-services enrich-loop; fills document summaries, passage claims, embeddings.
vLLM — OpenAI-compatible endpoint; all LLM calls go through Outlines (constrained decoding) or instructor (fallback).
Qdrant — vector store for document / passage / claim embeddings.
Observability (optional) — Prometheus scrapes backend /api/metrics and dcgm-exporter for per-GPU util / VRAM / temperature; Grafana ships a pre-provisioned dashboard.

Prerequisites¶

Docker Engine + Compose v2 (docker compose version ≥ v2).
NVIDIA Container Toolkit so vllm and worker can reserve GPUs (driver: nvidia, capabilities: [gpu]).
A local copy of the LLM weights under models/ (the compose file bind-mounts ./models:/models:ro). Default path: models/qwen3-4b-instruct.
For source-mode development: Python 3.12 (conda env infolake works) and Node 20 for the frontend.

Quick start (containerised)¶

```bash

1. Build images (first time only; cached afterwards).¶

make build

2. Bring the full stack up.¶

make up # qdrant + vllm + worker + backend + frontend

or, with observability:¶

make up-obs # also starts prometheus + grafana + dcgm-exporter

3. Sanity check.¶

make doctor # JSON health report across config, DB, GPUs, services make ps # docker compose ps

4. Open the UI.¶

http://127.0.0.1:8080 (frontend, served by nginx)¶

http://127.0.0.1:8000/api/docs (backend OpenAPI)¶

http://127.0.0.1:3000 (grafana; admin / $GRAFANA_ADMIN_PASSWORD, default "admin")¶

Tail logs from every container (Ctrl-C to stop).¶

make logs

Teardown; -v also wipes the Qdrant volume.¶

make down make down-wipe ```

Per-service container operations use infolake-services (or its Makefile alias):

bash infolake-services up --only qdrant,vllm infolake-services restart --only worker infolake-services logs vllm --tail 200 -f infolake-services status --json infolake-services build --pull --no-cache

Source install (no containers)¶

Useful for running tests, running one stage at a time on a dev box, or iterating on the pipeline without rebuilding images.

```bash

Create / refresh the conda env.¶

conda env update -f environment.yml --prune

Install infolake + dev + GPU extras in editable mode.¶

conda run -n infolake pip install -e ".[dev,gpu,obs]"

Run the CLIs (console scripts declared in pyproject.toml).¶

conda run -n infolake infolake-doctor --json conda run -n infolake infolake-backend --host 127.0.0.1 --port 8000 conda run -n infolake infolake-services enrich-loop --poll-interval-seconds 10 ```

All CLIs (infolake-doctor, infolake-services, infolake-backend, infolake-pipeline, infolake-mapping, infolake-inspect, and the dispatcher infolake <verb>) accept --help.

Running each part¶

Database (SQLite + FTS5 + Alembic)¶

```bash

Initialize / verify schema + FTS virtual tables.¶

conda run -n infolake python -m infolake.cli.init_database

Health report (config, db, services, GPUs, runtime).¶

conda run -n infolake infolake-doctor --json conda run -n infolake infolake-doctor --watch 5 # live view during warmup

Data-quality + schema inspection report.¶

conda run -n infolake infolake-inspect

Forward-only Alembic migrations.¶

conda run -n infolake alembic upgrade head ```

Crawling¶

```bash

Single session (containerised path; runs inside the worker container).¶

docker compose exec worker infolake-services crawl \ --session-id demo_$(date +%s) --seeds-path /app/seeds.txt

Single session (source path).¶

conda run -n infolake infolake-services crawl \ --session-id demo_$(date +%s) --seeds-path seeds.txt ```

Checkpoints land in checkpoints/<session_id>.json; a crawl can be resumed by re-running the command with the same --session-id. Deny/allow domains and URL-pattern filters live under crawl.filters in the config.

Enrichment¶

```bash

One-shot document + passage pass.¶

conda run -n infolake infolake-services enrich

Persistent loop (the `worker` container runs this as its CMD).¶

conda run -n infolake infolake-services enrich-loop --poll-interval-seconds 10

Smoke-test a single URL/text pair against a temp DB.¶

conda run -n infolake infolake-services smoke \ --sample-url "https://example.com" --sample-text "Sample content" --with-embeddings ```

Enrichment resume state is implicit: any row with editorial_verdict IS NULL or a stale enrichment_model_version gets re-picked. Bump enrichment.model_version in the config to force a re-enrichment pass.

End-to-end pipeline¶

```bash

Runs the stages listed in pipeline.stages (from config) in order.¶

conda run -n infolake infolake-pipeline \ --session-id nightly_$(date +%s) --seeds-path seeds.txt ```

The stage list is resolved through the infolake.pipeline_stages entry-point group (see Extending), so swapping one stage for another is a config edit, not a code edit.

Mapping (diffusion-map layout)¶

```bash

Full recompute from the current corpus.¶

conda run -n infolake infolake-mapping --full

Incremental update for newly-added documents.¶

conda run -n infolake infolake-mapping --incremental ```

Artifacts (eigenvectors, UMAP coordinates, region labels) are written under data/mapping/ and consumed by /api/map-data.

Backend (FastAPI)¶

```bash

Dev server with autoreload.¶

conda run -n infolake infolake-backend --host 127.0.0.1 --port 8000 --reload

Production (what the backend container runs):¶

uvicorn infolake.backend.app:app --host 0.0.0.0 --port 8000 \ --workers 2 --loop uvloop --http httptools ```

/api/health is a liveness probe; /api/ready returns a distilled SystemReport (config + DB + per-service probes) and flips to 503 on any failure.

Frontend (React + Vite)¶

```bash

Dev with hot reload — Vite proxies /api/* to localhost:8000.¶

cd frontend npm ci npm run dev -- --host 0.0.0.0 --port 5173

Production build (what the frontend container bakes).¶

npm run build ```

The containerised frontend is always the last-built bundle; for live UI work, enable the dev profile:

```bash docker compose --profile dev up -d frontend-dev

http://127.0.0.1:5173¶

```

Configuration¶

Configuration is a single Pydantic-validated JSON file: config/default.json. Every key is declared in src/infolake/core/schemas.py with model_config = ConfigDict(extra="forbid"), so a typo or stale key fails loudly at boot.

Base file: config/default.json (committed).
Host override: config/local.json (gitignored; see config/local.example.json).
Env-var override: INFOLAKE_CONFIG_PATH=/path/to/config.json.

How to tweak the common knobs¶

Section	Key	Default	When to change it
`storage`	`memory_cap_gb`	`16`	Lower on smaller hosts; applied as SQLite `cache_size`/`mmap_size`.
`database`	`sqlite_path`	`data/infolake.db`	Point at a different corpus or staging DB.
`crawl`	`strategy_schedule`	BFS 30/5000 → DFS 30/10000	Shape the crawl frontier (BFS first for breadth, DFS for depth).
`crawl`	`memory_threshold_percent`	`90.0`	Drop if crawler OOMs; it pauses dispatch above this RAM %.
`crawl`	`max_session_permit`	`32`	Concurrent in-flight pages; bound by `crawl4ai.concurrency`.
`crawl.filters`	`deny_domains`, `blacklist_domains`, `exclude_patterns`	—	Keep the crawl off social/ad domains + pagination spam.
`crawl.rate_limits`	`default_domain_delay_seconds`	`1.0`	Be nicer to origin servers.
`crawl4ai`	`concurrency`	`32`	Raise for fast-network / small-page seed lists.
`crawl4ai`	`browser_type`, `headless`, `check_robots_txt`	chromium, true, true	Toggle for debugging render issues.
`llm`	`model_id`, `model_path`	`Qwen/Qwen3-4B-Instruct`, `models/qwen3-4b-instruct`	Swap model; path is read inside the `vllm` container.
`llm`	`max_input_tokens`, `max_new_tokens`	3072, 1024	Must stay under `vllm.max_model_len`. Enrichment enforces the input side at prompt boundaries.
`llm`	`constrained_decoding`	`outlines`	`outlines` (grammar binding) → `instructor` (retry) → `none`.
`vllm`	`tensor_parallel_size` / `pipeline_parallel_size` / `data_parallel_size`	1 / 1 / 5	GPU layout. World size = TP × PP × DP; start with DP=n_gpus for throughput.
`vllm`	`max_model_len`	`32768`	Lower first if startup OOMs; raise only when workloads truly need longer context.
`vllm`	`max_num_batched_tokens` / `max_num_seqs`	16384 / 32	Throughput vs latency. Raise batched tokens when GPUs are idle; lower seqs for long-context traffic.
`vllm`	`gpu_memory_utilization`	`0.9`	Reduce if warmup OOMs; increase cautiously for more KV-cache capacity.
`enrichment`	`batch_size`, `llm_concurrency`	32, 12	Upstream throughput knobs for the worker.
`enrichment`	`model_version`	`1`	Bump to force re-enrichment across the corpus.
`embedding`	`model`, `device`, `batch_size`	BGE-small, cpu, 256	Switch to `cuda` and smaller batch if CPU becomes the bottleneck.
`qdrant`	`url`	`http://127.0.0.1:6333`	Point at an alternate Qdrant. In compose it becomes `http://qdrant:6333`.
`diffusion_map`	`alpha`, `t`, `k`, `knn_k`	1.0, 1.0, 20, 20	Diffusion-map hyperparameters; see `docs/spec.md` §8.
`diffusion_map`	`edges`, `edge_weights`	citations / co_citations / semantic / claim_overlap / domain_links	Turn off a source by removing its name from `edges`.
`diffusion_map.coloring`	`hue_eigenvector`, `sat_eigenvector`, `lightness_eigenvector`	4, 5, 3	Rebalance the colour space if clusters collapse.
`traversal`	`weights`	info_gain 0.40 / quality 0.25 / diffusion 0.20 / citation 0.15	Rank candidate next-hops for the route service.
`routes`	`n_routes`, `max_hops`, `cache_capacity`	3, 8, 256	`/api/routes` behaviour.
`logging`	`level`	`WARNING`	Raise to `INFO`/`DEBUG` for noisier triage; env override: `INFOLAKE_LOG_LEVEL`.
`diagnostics.telemetry`	`enabled`	`false`	Flip on to mount `/api/metrics` and wire the OTel exporter (requires `[obs]` extras).
`pipeline`	`stages`	`["crawl","enrich_documents","enrich_passages","compute_mapping"]`	Reorder / drop / add stages. Names resolve through the `infolake.pipeline_stages` entry-point group.

vLLM throughput tuning, in order¶

Set the GPU layout (tensor_parallel_size × pipeline_parallel_size × data_parallel_size must equal the number of available GPUs).
Set max_model_len to the largest context you actually need.
Push max_num_batched_tokens up while GPUs are underutilised; pull it back on latency spikes or OOM.
Trim max_num_seqs for long-context traffic; raise it for many short requests.
Nudge gpu_memory_utilization last.

Diagnostics and observability¶

```bash

Unified report: config + DB + GPUs (pynvml) + runtime (psutil) + service probes.¶

infolake-doctor --json infolake-doctor --gpu-only infolake-doctor --watch 5 # refresh loop, good for vLLM warmup

Live Prometheus metrics (when diagnostics.telemetry.enabled = true).¶

curl -s http://127.0.0.1:8000/api/metrics | head

Full observability stack (prometheus + grafana + dcgm-exporter).¶

make up-obs

http://127.0.0.1:9090 prometheus¶

http://127.0.0.1:3000 grafana (pre-provisioned Infolake dashboard)¶

http://127.0.0.1:9400 dcgm-exporter¶

```

Structured JSON logs are emitted on stdout when running inside a container (INFOLAKE_LOG_FORMAT=json); host CLI runs keep the human-readable format.

Project layout¶

``` src/infolake/ core/ # config, schemas, DB handle, repositories/, llm/, graph, container backend/ # FastAPI app + /api routers pipelines/ crawling/ # orchestrator + scheduler + fetcher + writer + stage adapter enrichment/ # consumer + per-stage adapters mapping/ # diffusion map + projection + coloring + labels + stage adapter services/ # local-process managers (VLLMManager, QdrantService, Worker) diagnostics/ # gpu, runtime, services, telemetry, report extensions/ # Protocol interfaces + entry-point Registry cli/ # console scripts (doctor, services, backend, pipeline, inspect, ...) db/ # bootstrap, fts, maintenance, alembic/

docker/ # Dockerfiles, nginx conf, prometheus.yml, grafana provisioning compose.yml # production topology (+ observability profile) compose.override.yml # dev: bind-mount src, --reload, optional Vite dev server

config/ default.json # committed base config local.example.json # per-host override template (gitignored as config/local.json)

frontend/ # React + Vite app, delivered by nginx in compose tests/ # pytest suite; conftest points at a temp fixture config ```

Testing¶

```bash

Everything.¶

make test

Or, manually:¶

conda run -n infolake pytest --cov --cov-report=term

Single file.¶

conda run -n infolake pytest tests/test_mapping_pipeline.py -v ```

tests/conftest.py exports INFOLAKE_CONFIG_PATH to a temp fixture JSON, so tests never touch the production DB or config/default.json.

Extending¶

Every cross-cutting capability is a plugin seam. Third-party packages register implementations via pyproject.toml entry points; infolake discovers them through importlib.metadata.entry_points at first use.

Group	Contract	Example (in-repo)
`infolake.llm_backends`	`LLMBackend`	`infolake.core.llm.outlines:OutlinesLLMClient`
`infolake.embedders`	`Embedder`	`infolake.core.embedding:TextEmbeddingClient`
`infolake.fetchers`	`Fetcher`	`infolake.core.crawl4ai_client:Crawl4AIClient`
`infolake.pipeline_stages`	`PipelineStage`	`infolake.pipelines.crawling.stage:CrawlStage`
`infolake.projections`	`Projector`	`infolake.pipelines.mapping.project:UMAPProjector`
`infolake.graph_edges`	`GraphEdgeSource`	(names listed in `diffusion_map.edges`)

Example third-party registration:

toml [project.entry-points."infolake.llm_backends"] vertex_ai = "my_pkg.llm:VertexAIBackend"

Then switch llm.constrained_decoding = "vertex_ai" (or wire it through the Registry.get(...) call) — no edits to infolake required.

Developer loop¶

bash make lint # ruff check make fmt # ruff format + autofix make typecheck # mypy on core + backend make check # lint + typecheck + test make precommit # run every pre-commit hook against all files

CI mirrors this (see .github/workflows/ci.yml): lint-python, test-python, frontend, docker-smoke (builds backend + frontend images, caches GPU images without running).

Deploy-host sync workflow¶

This repo is developed on one machine and deployed from another (the Mac mini that fronts compose.public.yml + cloudflared). To keep git pull on the deploy host boring — no conflicts, no accidental pushes of host-only tweaks — split files into three buckets and use the rules already baked into .gitignore.

File buckets¶

Bucket	Examples	How it's handled
Shared, committed	`src/`, `frontend/`, `config/default.json`, `compose.yml`, `docker/backend.Dockerfile`	Edit on the dev machine; pulled read-only on the deploy host.
Deploy-only, never tracked	`docker/frontend.public.Dockerfile`, `docker/nginx/public.conf`, `config/public.example.json`, `compose.public.override.yml`, `.env.public`	Listed in `.gitignore`; live only on the deploy host.
Committed template, locally diverged	`compose.public.yml`, `config/config.json`, `.env.public.example`	Tracked upstream as a baseline; host tweaks stay local via `git update-index --skip-worktree <file>`.

One-time setup on the deploy host¶

Run once per fresh clone so local tweaks to tracked templates are invisible to git status / git pull:

```bash git update-index --skip-worktree compose.public.yml config/config.json

Optional: list everything currently skip-worktree'd¶

git ls-files -v | awk '/^S/ {print $2}' ```

To undo the flag (e.g. to pull a genuinely new upstream version of compose.public.yml), run git update-index --no-skip-worktree <file>, git stash, git pull, git stash pop, then re-set the flag.

Routine pull (deploy host)¶

```bash

1. Sanity check: no stray commits on the deploy host.¶

git status git log --oneline origin/main..HEAD # should print nothing

2. Fetch + fast-forward.¶

git fetch origin git pull --ff-only origin main

3. Rebuild only what moved.¶

make build # (or: docker compose -f compose.public.yml build) docker compose -f compose.public.yml up -d ```

If step 1 shows local commits on the deploy host, you probably edited a shared file by mistake. Move the change to a deploy-only file if possible, otherwise git reset --soft origin/main to move the commit back to the index, split out the deploy-only bits into gitignored files, and discard the rest.

Routine push (dev machine)¶

```bash

Never `git add` anything under the "deploy-only" bucket above.¶

git status # should only list shared files git add -u # or explicit paths git commit -m "..." git push origin main ```

If you ever need to track a new deploy-only file¶

Create it under one of the ignored paths (e.g. docker/*.public.Dockerfile, docker/nginx/public.conf, config/public.example.json), or
Add its path/pattern to the "Deploy-machine-only files" block in .gitignore, commit, push, and keep editing it locally.

Emergency: diverged with a local commit you don't need¶

```bash

Save any working-tree edits first.¶

git stash -u

Drop the local commit, align with cloud.¶

git fetch origin git reset --hard origin/main

Re-apply edits you still want.¶

git stash pop ```

Infolake (Truth Atlas)¶

Topology¶

Prerequisites¶

Quick start (containerised)¶

1. Build images (first time only; cached afterwards).¶

2. Bring the full stack up.¶

or, with observability:¶

3. Sanity check.¶

4. Open the UI.¶

http://127.0.0.1:8080 (frontend, served by nginx)¶

http://127.0.0.1:8000/api/docs (backend OpenAPI)¶

http://127.0.0.1:3000 (grafana; admin / $GRAFANA_ADMIN_PASSWORD, default "admin")¶

Tail logs from every container (Ctrl-C to stop).¶

Teardown; -v also wipes the Qdrant volume.¶

Source install (no containers)¶

Create / refresh the conda env.¶

Install infolake + dev + GPU extras in editable mode.¶

Run the CLIs (console scripts declared in pyproject.toml).¶

Running each part¶

Database (SQLite + FTS5 + Alembic)¶

Initialize / verify schema + FTS virtual tables.¶

Health report (config, db, services, GPUs, runtime).¶

Data-quality + schema inspection report.¶

Forward-only Alembic migrations.¶

Crawling¶

Single session (containerised path; runs inside the worker container).¶

Single session (source path).¶

Enrichment¶

One-shot document + passage pass.¶

Persistent loop (the worker container runs this as its CMD).¶

Smoke-test a single URL/text pair against a temp DB.¶

End-to-end pipeline¶

Runs the stages listed in pipeline.stages (from config) in order.¶

Mapping (diffusion-map layout)¶

Full recompute from the current corpus.¶

Incremental update for newly-added documents.¶

Backend (FastAPI)¶

Dev server with autoreload.¶

Production (what the backend container runs):¶

Frontend (React + Vite)¶

Dev with hot reload — Vite proxies /api/* to localhost:8000.¶

Production build (what the frontend container bakes).¶

http://127.0.0.1:5173¶

Configuration¶

How to tweak the common knobs¶

vLLM throughput tuning, in order¶

Diagnostics and observability¶

Unified report: config + DB + GPUs (pynvml) + runtime (psutil) + service probes.¶

Live Prometheus metrics (when diagnostics.telemetry.enabled = true).¶

Full observability stack (prometheus + grafana + dcgm-exporter).¶

http://127.0.0.1:9090 prometheus¶

http://127.0.0.1:3000 grafana (pre-provisioned Infolake dashboard)¶

http://127.0.0.1:9400 dcgm-exporter¶

Project layout¶

Testing¶

Everything.¶

Or, manually:¶

Single file.¶

Extending¶

Developer loop¶

Deploy-host sync workflow¶

File buckets¶

One-time setup on the deploy host¶

Optional: list everything currently skip-worktree'd¶

Routine pull (deploy host)¶

1. Sanity check: no stray commits on the deploy host.¶

2. Fetch + fast-forward.¶

3. Rebuild only what moved.¶

Routine push (dev machine)¶

Never git add anything under the "deploy-only" bucket above.¶

If you ever need to track a new deploy-only file¶

Emergency: diverged with a local commit you don't need¶

Save any working-tree edits first.¶

Drop the local commit, align with cloud.¶

Re-apply edits you still want.¶

Further reading¶

Persistent loop (the `worker` container runs this as its CMD).¶

Never `git add` anything under the "deploy-only" bucket above.¶