Skip to content

Canonical source

This page is a one-to-one mirror of README.md at the repository root. Edit that file, not this one.

Truth Atlas UI

Infolake (Truth Atlas)

Infolake / Truth Atlas is an end-to-end pipeline for turning a seed list of URLs into a browsable 2-D map of the web, coloured by topic and connected by citation / semantic edges. It crawls, enriches with a local LLM, embeds, builds a diffusion-map layout, and serves everything through a FastAPI backend and a React + deck.gl frontend.

The system is designed for a single GPU-accelerated workstation (tested on 5× RTX 3090 with CUDA 12.8) and is fully containerised via Docker Compose. vLLM runs with all GPUs; Qdrant holds the vector caches; SQLite + FTS5 is the structural store.


Topology

flowchart LR
  subgraph host [Host, GPUs 0..4]
    subgraph compose [docker compose]
      frontend[frontend<br/>nginx :8080]
      backend[backend<br/>FastAPI :8000]
      worker[worker<br/>enrich-loop]
      vllm["vllm<br/>OpenAI API :8765"]
      qdrant[qdrant :6333]
      prom[prometheus<br/>profile=observability]
      dcgm[dcgm-exporter<br/>profile=observability]
      graf[grafana :3000<br/>profile=observability]
    end
    gpus[["NVIDIA GPUs"]]
  end
  frontend -->|/api/*| backend
  backend  -->|HTTP| qdrant
  backend  -->|HTTP| vllm
  worker   -->|HTTP| vllm
  worker   -->|HTTP| qdrant
  worker   -->|sqlite| vol[("./data bind mount")]
  backend  --> vol
  vllm     -->|"nvidia runtime"| gpus
  dcgm     -->|NVML| gpus
  prom     --> backend
  prom     --> dcgm
  graf     --> prom
  • Frontend — React 18 + Vite + deck.gl atlas view, shipped as a static bundle served by nginx.
  • Backend — FastAPI at /api/* (/api/health, /api/ready, /api/map-data, /api/docs, /api/candidates, /api/routes, /api/stars, …).
  • Worker — persistent infolake-services enrich-loop; fills document summaries, passage claims, embeddings.
  • vLLM — OpenAI-compatible endpoint; all LLM calls go through Outlines (constrained decoding) or instructor (fallback).
  • Qdrant — vector store for document / passage / claim embeddings.
  • Observability (optional) — Prometheus scrapes backend /api/metrics and dcgm-exporter for per-GPU util / VRAM / temperature; Grafana ships a pre-provisioned dashboard.

Prerequisites

  • Docker Engine + Compose v2 (docker compose version ≥ v2).
  • NVIDIA Container Toolkit so vllm and worker can reserve GPUs (driver: nvidia, capabilities: [gpu]).
  • A local copy of the LLM weights under models/ (the compose file bind-mounts ./models:/models:ro). Default path: models/qwen3-4b-instruct.
  • For source-mode development: Python 3.12 (conda env infolake works) and Node 20 for the frontend.

Quick start (containerised)

```bash

1. Build images (first time only; cached afterwards).

make build

2. Bring the full stack up.

make up # qdrant + vllm + worker + backend + frontend

or, with observability:

make up-obs # also starts prometheus + grafana + dcgm-exporter

3. Sanity check.

make doctor # JSON health report across config, DB, GPUs, services make ps # docker compose ps

4. Open the UI.

http://127.0.0.1:8080 (frontend, served by nginx)

http://127.0.0.1:8000/api/docs (backend OpenAPI)

http://127.0.0.1:3000 (grafana; admin / $GRAFANA_ADMIN_PASSWORD, default "admin")

Tail logs from every container (Ctrl-C to stop).

make logs

Teardown; -v also wipes the Qdrant volume.

make down make down-wipe ```

Per-service container operations use infolake-services (or its Makefile alias):

bash infolake-services up --only qdrant,vllm infolake-services restart --only worker infolake-services logs vllm --tail 200 -f infolake-services status --json infolake-services build --pull --no-cache


Source install (no containers)

Useful for running tests, running one stage at a time on a dev box, or iterating on the pipeline without rebuilding images.

```bash

Create / refresh the conda env.

conda env update -f environment.yml --prune

Install infolake + dev + GPU extras in editable mode.

conda run -n infolake pip install -e ".[dev,gpu,obs]"

Run the CLIs (console scripts declared in pyproject.toml).

conda run -n infolake infolake-doctor --json conda run -n infolake infolake-backend --host 127.0.0.1 --port 8000 conda run -n infolake infolake-services enrich-loop --poll-interval-seconds 10 ```

All CLIs (infolake-doctor, infolake-services, infolake-backend, infolake-pipeline, infolake-mapping, infolake-inspect, and the dispatcher infolake <verb>) accept --help.


Running each part

Database (SQLite + FTS5 + Alembic)

```bash

Initialize / verify schema + FTS virtual tables.

conda run -n infolake python -m infolake.cli.init_database

Health report (config, db, services, GPUs, runtime).

conda run -n infolake infolake-doctor --json conda run -n infolake infolake-doctor --watch 5 # live view during warmup

Data-quality + schema inspection report.

conda run -n infolake infolake-inspect

Forward-only Alembic migrations.

conda run -n infolake alembic upgrade head ```

Crawling

```bash

Single session (containerised path; runs inside the worker container).

docker compose exec worker infolake-services crawl \ --session-id demo_$(date +%s) --seeds-path /app/seeds.txt

Single session (source path).

conda run -n infolake infolake-services crawl \ --session-id demo_$(date +%s) --seeds-path seeds.txt ```

Checkpoints land in checkpoints/<session_id>.json; a crawl can be resumed by re-running the command with the same --session-id. Deny/allow domains and URL-pattern filters live under crawl.filters in the config.

Enrichment

```bash

One-shot document + passage pass.

conda run -n infolake infolake-services enrich

Persistent loop (the worker container runs this as its CMD).

conda run -n infolake infolake-services enrich-loop --poll-interval-seconds 10

Smoke-test a single URL/text pair against a temp DB.

conda run -n infolake infolake-services smoke \ --sample-url "https://example.com" --sample-text "Sample content" --with-embeddings ```

Enrichment resume state is implicit: any row with editorial_verdict IS NULL or a stale enrichment_model_version gets re-picked. Bump enrichment.model_version in the config to force a re-enrichment pass.

End-to-end pipeline

```bash

Runs the stages listed in pipeline.stages (from config) in order.

conda run -n infolake infolake-pipeline \ --session-id nightly_$(date +%s) --seeds-path seeds.txt ```

The stage list is resolved through the infolake.pipeline_stages entry-point group (see Extending), so swapping one stage for another is a config edit, not a code edit.

Mapping (diffusion-map layout)

```bash

Full recompute from the current corpus.

conda run -n infolake infolake-mapping --full

Incremental update for newly-added documents.

conda run -n infolake infolake-mapping --incremental ```

Artifacts (eigenvectors, UMAP coordinates, region labels) are written under data/mapping/ and consumed by /api/map-data.

Backend (FastAPI)

```bash

Dev server with autoreload.

conda run -n infolake infolake-backend --host 127.0.0.1 --port 8000 --reload

Production (what the backend container runs):

uvicorn infolake.backend.app:app --host 0.0.0.0 --port 8000 \ --workers 2 --loop uvloop --http httptools ```

/api/health is a liveness probe; /api/ready returns a distilled SystemReport (config + DB + per-service probes) and flips to 503 on any failure.

Frontend (React + Vite)

```bash

Dev with hot reload — Vite proxies /api/* to localhost:8000.

cd frontend npm ci npm run dev -- --host 0.0.0.0 --port 5173

Production build (what the frontend container bakes).

npm run build ```

The containerised frontend is always the last-built bundle; for live UI work, enable the dev profile:

```bash docker compose --profile dev up -d frontend-dev

http://127.0.0.1:5173

```


Configuration

Configuration is a single Pydantic-validated JSON file: config/default.json. Every key is declared in src/infolake/core/schemas.py with model_config = ConfigDict(extra="forbid"), so a typo or stale key fails loudly at boot.

  • Base file: config/default.json (committed).
  • Host override: config/local.json (gitignored; see config/local.example.json).
  • Env-var override: INFOLAKE_CONFIG_PATH=/path/to/config.json.

How to tweak the common knobs

Section Key Default When to change it
storage memory_cap_gb 16 Lower on smaller hosts; applied as SQLite cache_size/mmap_size.
database sqlite_path data/infolake.db Point at a different corpus or staging DB.
crawl strategy_schedule BFS 30/5000 → DFS 30/10000 Shape the crawl frontier (BFS first for breadth, DFS for depth).
crawl memory_threshold_percent 90.0 Drop if crawler OOMs; it pauses dispatch above this RAM %.
crawl max_session_permit 32 Concurrent in-flight pages; bound by crawl4ai.concurrency.
crawl.filters deny_domains, blacklist_domains, exclude_patterns Keep the crawl off social/ad domains + pagination spam.
crawl.rate_limits default_domain_delay_seconds 1.0 Be nicer to origin servers.
crawl4ai concurrency 32 Raise for fast-network / small-page seed lists.
crawl4ai browser_type, headless, check_robots_txt chromium, true, true Toggle for debugging render issues.
llm model_id, model_path Qwen/Qwen3-4B-Instruct, models/qwen3-4b-instruct Swap model; path is read inside the vllm container.
llm max_input_tokens, max_new_tokens 3072, 1024 Must stay under vllm.max_model_len. Enrichment enforces the input side at prompt boundaries.
llm constrained_decoding outlines outlines (grammar binding) → instructor (retry) → none.
vllm tensor_parallel_size / pipeline_parallel_size / data_parallel_size 1 / 1 / 5 GPU layout. World size = TP × PP × DP; start with DP=n_gpus for throughput.
vllm max_model_len 32768 Lower first if startup OOMs; raise only when workloads truly need longer context.
vllm max_num_batched_tokens / max_num_seqs 16384 / 32 Throughput vs latency. Raise batched tokens when GPUs are idle; lower seqs for long-context traffic.
vllm gpu_memory_utilization 0.9 Reduce if warmup OOMs; increase cautiously for more KV-cache capacity.
enrichment batch_size, llm_concurrency 32, 12 Upstream throughput knobs for the worker.
enrichment model_version 1 Bump to force re-enrichment across the corpus.
embedding model, device, batch_size BGE-small, cpu, 256 Switch to cuda and smaller batch if CPU becomes the bottleneck.
qdrant url http://127.0.0.1:6333 Point at an alternate Qdrant. In compose it becomes http://qdrant:6333.
diffusion_map alpha, t, k, knn_k 1.0, 1.0, 20, 20 Diffusion-map hyperparameters; see docs/spec.md §8.
diffusion_map edges, edge_weights citations / co_citations / semantic / claim_overlap / domain_links Turn off a source by removing its name from edges.
diffusion_map.coloring hue_eigenvector, sat_eigenvector, lightness_eigenvector 4, 5, 3 Rebalance the colour space if clusters collapse.
traversal weights info_gain 0.40 / quality 0.25 / diffusion 0.20 / citation 0.15 Rank candidate next-hops for the route service.
routes n_routes, max_hops, cache_capacity 3, 8, 256 /api/routes behaviour.
logging level WARNING Raise to INFO/DEBUG for noisier triage; env override: INFOLAKE_LOG_LEVEL.
diagnostics.telemetry enabled false Flip on to mount /api/metrics and wire the OTel exporter (requires [obs] extras).
pipeline stages ["crawl","enrich_documents","enrich_passages","compute_mapping"] Reorder / drop / add stages. Names resolve through the infolake.pipeline_stages entry-point group.

vLLM throughput tuning, in order

  1. Set the GPU layout (tensor_parallel_size × pipeline_parallel_size × data_parallel_size must equal the number of available GPUs).
  2. Set max_model_len to the largest context you actually need.
  3. Push max_num_batched_tokens up while GPUs are underutilised; pull it back on latency spikes or OOM.
  4. Trim max_num_seqs for long-context traffic; raise it for many short requests.
  5. Nudge gpu_memory_utilization last.

Diagnostics and observability

```bash

Unified report: config + DB + GPUs (pynvml) + runtime (psutil) + service probes.

infolake-doctor --json infolake-doctor --gpu-only infolake-doctor --watch 5 # refresh loop, good for vLLM warmup

Live Prometheus metrics (when diagnostics.telemetry.enabled = true).

curl -s http://127.0.0.1:8000/api/metrics | head

Full observability stack (prometheus + grafana + dcgm-exporter).

make up-obs

http://127.0.0.1:9090 prometheus

http://127.0.0.1:3000 grafana (pre-provisioned Infolake dashboard)

http://127.0.0.1:9400 dcgm-exporter

```

Structured JSON logs are emitted on stdout when running inside a container (INFOLAKE_LOG_FORMAT=json); host CLI runs keep the human-readable format.


Project layout

``` src/infolake/ core/ # config, schemas, DB handle, repositories/, llm/, graph, container backend/ # FastAPI app + /api routers pipelines/ crawling/ # orchestrator + scheduler + fetcher + writer + stage adapter enrichment/ # consumer + per-stage adapters mapping/ # diffusion map + projection + coloring + labels + stage adapter services/ # local-process managers (VLLMManager, QdrantService, Worker) diagnostics/ # gpu, runtime, services, telemetry, report extensions/ # Protocol interfaces + entry-point Registry cli/ # console scripts (doctor, services, backend, pipeline, inspect, ...) db/ # bootstrap, fts, maintenance, alembic/

docker/ # Dockerfiles, nginx conf, prometheus.yml, grafana provisioning compose.yml # production topology (+ observability profile) compose.override.yml # dev: bind-mount src, --reload, optional Vite dev server

config/ default.json # committed base config local.example.json # per-host override template (gitignored as config/local.json)

frontend/ # React + Vite app, delivered by nginx in compose tests/ # pytest suite; conftest points at a temp fixture config ```


Testing

```bash

Everything.

make test

Or, manually:

conda run -n infolake pytest --cov --cov-report=term

Single file.

conda run -n infolake pytest tests/test_mapping_pipeline.py -v ```

tests/conftest.py exports INFOLAKE_CONFIG_PATH to a temp fixture JSON, so tests never touch the production DB or config/default.json.


Extending

Every cross-cutting capability is a plugin seam. Third-party packages register implementations via pyproject.toml entry points; infolake discovers them through importlib.metadata.entry_points at first use.

Group Contract Example (in-repo)
infolake.llm_backends LLMBackend infolake.core.llm.outlines:OutlinesLLMClient
infolake.embedders Embedder infolake.core.embedding:TextEmbeddingClient
infolake.fetchers Fetcher infolake.core.crawl4ai_client:Crawl4AIClient
infolake.pipeline_stages PipelineStage infolake.pipelines.crawling.stage:CrawlStage
infolake.projections Projector infolake.pipelines.mapping.project:UMAPProjector
infolake.graph_edges GraphEdgeSource (names listed in diffusion_map.edges)

Example third-party registration:

toml [project.entry-points."infolake.llm_backends"] vertex_ai = "my_pkg.llm:VertexAIBackend"

Then switch llm.constrained_decoding = "vertex_ai" (or wire it through the Registry.get(...) call) — no edits to infolake required.


Developer loop

bash make lint # ruff check make fmt # ruff format + autofix make typecheck # mypy on core + backend make check # lint + typecheck + test make precommit # run every pre-commit hook against all files

CI mirrors this (see .github/workflows/ci.yml): lint-python, test-python, frontend, docker-smoke (builds backend + frontend images, caches GPU images without running).


Deploy-host sync workflow

This repo is developed on one machine and deployed from another (the Mac mini that fronts compose.public.yml + cloudflared). To keep git pull on the deploy host boring — no conflicts, no accidental pushes of host-only tweaks — split files into three buckets and use the rules already baked into .gitignore.

File buckets

Bucket Examples How it's handled
Shared, committed src/, frontend/, config/default.json, compose.yml, docker/backend.Dockerfile Edit on the dev machine; pulled read-only on the deploy host.
Deploy-only, never tracked docker/frontend.public.Dockerfile, docker/nginx/public.conf, config/public.example.json, compose.public.override.yml, .env.public Listed in .gitignore; live only on the deploy host.
Committed template, locally diverged compose.public.yml, config/config.json, .env.public.example Tracked upstream as a baseline; host tweaks stay local via git update-index --skip-worktree <file>.

One-time setup on the deploy host

Run once per fresh clone so local tweaks to tracked templates are invisible to git status / git pull:

```bash git update-index --skip-worktree compose.public.yml config/config.json

Optional: list everything currently skip-worktree'd

git ls-files -v | awk '/^S/ {print $2}' ```

To undo the flag (e.g. to pull a genuinely new upstream version of compose.public.yml), run git update-index --no-skip-worktree <file>, git stash, git pull, git stash pop, then re-set the flag.

Routine pull (deploy host)

```bash

1. Sanity check: no stray commits on the deploy host.

git status git log --oneline origin/main..HEAD # should print nothing

2. Fetch + fast-forward.

git fetch origin git pull --ff-only origin main

3. Rebuild only what moved.

make build # (or: docker compose -f compose.public.yml build) docker compose -f compose.public.yml up -d ```

If step 1 shows local commits on the deploy host, you probably edited a shared file by mistake. Move the change to a deploy-only file if possible, otherwise git reset --soft origin/main to move the commit back to the index, split out the deploy-only bits into gitignored files, and discard the rest.

Routine push (dev machine)

```bash

Never git add anything under the "deploy-only" bucket above.

git status # should only list shared files git add -u # or explicit paths git commit -m "..." git push origin main ```

If you ever need to track a new deploy-only file

  1. Create it under one of the ignored paths (e.g. docker/*.public.Dockerfile, docker/nginx/public.conf, config/public.example.json), or
  2. Add its path/pattern to the "Deploy-machine-only files" block in .gitignore, commit, push, and keep editing it locally.

Emergency: diverged with a local commit you don't need

```bash

Save any working-tree edits first.

git stash -u

Drop the local commit, align with cloud.

git fetch origin git reset --hard origin/main

Re-apply edits you still want.

git stash pop ```


Further reading

  • CHANGELOG.md — versioned history, starting with the v5.2.0 refactor.
  • spec.md — authoritative schema + pipeline + coloring spec.
  • tip.md — rolling bugs-and-solutions log referenced by the cursor rules.
  • CLAUDE.md — agent-facing project context.