Canonical source
This page is a one-to-one mirror of
README.md
at the repository root. Edit that file, not this one.

Infolake (Truth Atlas)¶
Infolake / Truth Atlas is an end-to-end pipeline for turning a seed list of URLs into a browsable 2-D map of the web, coloured by topic and connected by citation / semantic edges. It crawls, enriches with a local LLM, embeds, builds a diffusion-map layout, and serves everything through a FastAPI backend and a React + deck.gl frontend.
The system is designed for a single GPU-accelerated workstation (tested on 5× RTX 3090 with CUDA 12.8) and is fully containerised via Docker Compose. vLLM runs with all GPUs; Qdrant holds the vector caches; SQLite + FTS5 is the structural store.
Topology¶
flowchart LR
subgraph host [Host, GPUs 0..4]
subgraph compose [docker compose]
frontend[frontend<br/>nginx :8080]
backend[backend<br/>FastAPI :8000]
worker[worker<br/>enrich-loop]
vllm["vllm<br/>OpenAI API :8765"]
qdrant[qdrant :6333]
prom[prometheus<br/>profile=observability]
dcgm[dcgm-exporter<br/>profile=observability]
graf[grafana :3000<br/>profile=observability]
end
gpus[["NVIDIA GPUs"]]
end
frontend -->|/api/*| backend
backend -->|HTTP| qdrant
backend -->|HTTP| vllm
worker -->|HTTP| vllm
worker -->|HTTP| qdrant
worker -->|sqlite| vol[("./data bind mount")]
backend --> vol
vllm -->|"nvidia runtime"| gpus
dcgm -->|NVML| gpus
prom --> backend
prom --> dcgm
graf --> prom
- Frontend — React 18 + Vite + deck.gl atlas view, shipped as a static bundle served by nginx.
- Backend — FastAPI at
/api/*(/api/health,/api/ready,/api/map-data,/api/docs,/api/candidates,/api/routes,/api/stars, …). - Worker — persistent
infolake-services enrich-loop; fills document summaries, passage claims, embeddings. - vLLM — OpenAI-compatible endpoint; all LLM calls go through
Outlines(constrained decoding) orinstructor(fallback). - Qdrant — vector store for document / passage / claim embeddings.
- Observability (optional) — Prometheus scrapes backend
/api/metricsanddcgm-exporterfor per-GPU util / VRAM / temperature; Grafana ships a pre-provisioned dashboard.
Prerequisites¶
- Docker Engine + Compose v2 (
docker compose version≥ v2). - NVIDIA Container Toolkit so
vllmandworkercan reserve GPUs (driver: nvidia,capabilities: [gpu]). - A local copy of the LLM weights under
models/(the compose file bind-mounts./models:/models:ro). Default path:models/qwen3-4b-instruct. - For source-mode development: Python 3.12 (conda env
infolakeworks) and Node 20 for the frontend.
Quick start (containerised)¶
```bash
1. Build images (first time only; cached afterwards).¶
make build
2. Bring the full stack up.¶
make up # qdrant + vllm + worker + backend + frontend
or, with observability:¶
make up-obs # also starts prometheus + grafana + dcgm-exporter
3. Sanity check.¶
make doctor # JSON health report across config, DB, GPUs, services make ps # docker compose ps
4. Open the UI.¶
http://127.0.0.1:8080 (frontend, served by nginx)¶
http://127.0.0.1:8000/api/docs (backend OpenAPI)¶
http://127.0.0.1:3000 (grafana; admin / $GRAFANA_ADMIN_PASSWORD, default "admin")¶
Tail logs from every container (Ctrl-C to stop).¶
make logs
Teardown; -v also wipes the Qdrant volume.¶
make down make down-wipe ```
Per-service container operations use infolake-services (or its Makefile alias):
bash
infolake-services up --only qdrant,vllm
infolake-services restart --only worker
infolake-services logs vllm --tail 200 -f
infolake-services status --json
infolake-services build --pull --no-cache
Source install (no containers)¶
Useful for running tests, running one stage at a time on a dev box, or iterating on the pipeline without rebuilding images.
```bash
Create / refresh the conda env.¶
conda env update -f environment.yml --prune
Install infolake + dev + GPU extras in editable mode.¶
conda run -n infolake pip install -e ".[dev,gpu,obs]"
Run the CLIs (console scripts declared in pyproject.toml).¶
conda run -n infolake infolake-doctor --json conda run -n infolake infolake-backend --host 127.0.0.1 --port 8000 conda run -n infolake infolake-services enrich-loop --poll-interval-seconds 10 ```
All CLIs (infolake-doctor, infolake-services, infolake-backend,
infolake-pipeline, infolake-mapping, infolake-inspect, and the
dispatcher infolake <verb>) accept --help.
Running each part¶
Database (SQLite + FTS5 + Alembic)¶
```bash
Initialize / verify schema + FTS virtual tables.¶
conda run -n infolake python -m infolake.cli.init_database
Health report (config, db, services, GPUs, runtime).¶
conda run -n infolake infolake-doctor --json conda run -n infolake infolake-doctor --watch 5 # live view during warmup
Data-quality + schema inspection report.¶
conda run -n infolake infolake-inspect
Forward-only Alembic migrations.¶
conda run -n infolake alembic upgrade head ```
Crawling¶
```bash
Single session (containerised path; runs inside the worker container).¶
docker compose exec worker infolake-services crawl \ --session-id demo_$(date +%s) --seeds-path /app/seeds.txt
Single session (source path).¶
conda run -n infolake infolake-services crawl \ --session-id demo_$(date +%s) --seeds-path seeds.txt ```
Checkpoints land in checkpoints/<session_id>.json; a crawl can be resumed by
re-running the command with the same --session-id. Deny/allow domains and
URL-pattern filters live under crawl.filters in the config.
Enrichment¶
```bash
One-shot document + passage pass.¶
conda run -n infolake infolake-services enrich
Persistent loop (the worker container runs this as its CMD).¶
conda run -n infolake infolake-services enrich-loop --poll-interval-seconds 10
Smoke-test a single URL/text pair against a temp DB.¶
conda run -n infolake infolake-services smoke \ --sample-url "https://example.com" --sample-text "Sample content" --with-embeddings ```
Enrichment resume state is implicit: any row with editorial_verdict IS NULL
or a stale enrichment_model_version gets re-picked. Bump
enrichment.model_version in the config to force a re-enrichment pass.
End-to-end pipeline¶
```bash
Runs the stages listed in pipeline.stages (from config) in order.¶
conda run -n infolake infolake-pipeline \ --session-id nightly_$(date +%s) --seeds-path seeds.txt ```
The stage list is resolved through the infolake.pipeline_stages entry-point
group (see Extending), so swapping one stage for another is
a config edit, not a code edit.
Mapping (diffusion-map layout)¶
```bash
Full recompute from the current corpus.¶
conda run -n infolake infolake-mapping --full
Incremental update for newly-added documents.¶
conda run -n infolake infolake-mapping --incremental ```
Artifacts (eigenvectors, UMAP coordinates, region labels) are written under
data/mapping/ and consumed by /api/map-data.
Backend (FastAPI)¶
```bash
Dev server with autoreload.¶
conda run -n infolake infolake-backend --host 127.0.0.1 --port 8000 --reload
Production (what the backend container runs):¶
uvicorn infolake.backend.app:app --host 0.0.0.0 --port 8000 \ --workers 2 --loop uvloop --http httptools ```
/api/health is a liveness probe; /api/ready returns a distilled
SystemReport (config + DB + per-service probes) and flips to 503 on any
failure.
Frontend (React + Vite)¶
```bash
Dev with hot reload — Vite proxies /api/* to localhost:8000.¶
cd frontend npm ci npm run dev -- --host 0.0.0.0 --port 5173
Production build (what the frontend container bakes).¶
npm run build ```
The containerised frontend is always the last-built bundle; for live UI work, enable the dev profile:
```bash docker compose --profile dev up -d frontend-dev
http://127.0.0.1:5173¶
```
Configuration¶
Configuration is a single Pydantic-validated JSON file:
config/default.json. Every key is declared in
src/infolake/core/schemas.py with
model_config = ConfigDict(extra="forbid"), so a typo or stale key fails
loudly at boot.
- Base file:
config/default.json(committed). - Host override:
config/local.json(gitignored; seeconfig/local.example.json). - Env-var override:
INFOLAKE_CONFIG_PATH=/path/to/config.json.
How to tweak the common knobs¶
| Section | Key | Default | When to change it |
|---|---|---|---|
storage |
memory_cap_gb |
16 |
Lower on smaller hosts; applied as SQLite cache_size/mmap_size. |
database |
sqlite_path |
data/infolake.db |
Point at a different corpus or staging DB. |
crawl |
strategy_schedule |
BFS 30/5000 → DFS 30/10000 | Shape the crawl frontier (BFS first for breadth, DFS for depth). |
crawl |
memory_threshold_percent |
90.0 |
Drop if crawler OOMs; it pauses dispatch above this RAM %. |
crawl |
max_session_permit |
32 |
Concurrent in-flight pages; bound by crawl4ai.concurrency. |
crawl.filters |
deny_domains, blacklist_domains, exclude_patterns |
— | Keep the crawl off social/ad domains + pagination spam. |
crawl.rate_limits |
default_domain_delay_seconds |
1.0 |
Be nicer to origin servers. |
crawl4ai |
concurrency |
32 |
Raise for fast-network / small-page seed lists. |
crawl4ai |
browser_type, headless, check_robots_txt |
chromium, true, true | Toggle for debugging render issues. |
llm |
model_id, model_path |
Qwen/Qwen3-4B-Instruct, models/qwen3-4b-instruct |
Swap model; path is read inside the vllm container. |
llm |
max_input_tokens, max_new_tokens |
3072, 1024 | Must stay under vllm.max_model_len. Enrichment enforces the input side at prompt boundaries. |
llm |
constrained_decoding |
outlines |
outlines (grammar binding) → instructor (retry) → none. |
vllm |
tensor_parallel_size / pipeline_parallel_size / data_parallel_size |
1 / 1 / 5 | GPU layout. World size = TP × PP × DP; start with DP=n_gpus for throughput. |
vllm |
max_model_len |
32768 |
Lower first if startup OOMs; raise only when workloads truly need longer context. |
vllm |
max_num_batched_tokens / max_num_seqs |
16384 / 32 | Throughput vs latency. Raise batched tokens when GPUs are idle; lower seqs for long-context traffic. |
vllm |
gpu_memory_utilization |
0.9 |
Reduce if warmup OOMs; increase cautiously for more KV-cache capacity. |
enrichment |
batch_size, llm_concurrency |
32, 12 | Upstream throughput knobs for the worker. |
enrichment |
model_version |
1 |
Bump to force re-enrichment across the corpus. |
embedding |
model, device, batch_size |
BGE-small, cpu, 256 | Switch to cuda and smaller batch if CPU becomes the bottleneck. |
qdrant |
url |
http://127.0.0.1:6333 |
Point at an alternate Qdrant. In compose it becomes http://qdrant:6333. |
diffusion_map |
alpha, t, k, knn_k |
1.0, 1.0, 20, 20 | Diffusion-map hyperparameters; see docs/spec.md §8. |
diffusion_map |
edges, edge_weights |
citations / co_citations / semantic / claim_overlap / domain_links | Turn off a source by removing its name from edges. |
diffusion_map.coloring |
hue_eigenvector, sat_eigenvector, lightness_eigenvector |
4, 5, 3 | Rebalance the colour space if clusters collapse. |
traversal |
weights |
info_gain 0.40 / quality 0.25 / diffusion 0.20 / citation 0.15 | Rank candidate next-hops for the route service. |
routes |
n_routes, max_hops, cache_capacity |
3, 8, 256 | /api/routes behaviour. |
logging |
level |
WARNING |
Raise to INFO/DEBUG for noisier triage; env override: INFOLAKE_LOG_LEVEL. |
diagnostics.telemetry |
enabled |
false |
Flip on to mount /api/metrics and wire the OTel exporter (requires [obs] extras). |
pipeline |
stages |
["crawl","enrich_documents","enrich_passages","compute_mapping"] |
Reorder / drop / add stages. Names resolve through the infolake.pipeline_stages entry-point group. |
vLLM throughput tuning, in order¶
- Set the GPU layout (
tensor_parallel_size×pipeline_parallel_size×data_parallel_sizemust equal the number of available GPUs). - Set
max_model_lento the largest context you actually need. - Push
max_num_batched_tokensup while GPUs are underutilised; pull it back on latency spikes or OOM. - Trim
max_num_seqsfor long-context traffic; raise it for many short requests. - Nudge
gpu_memory_utilizationlast.
Diagnostics and observability¶
```bash
Unified report: config + DB + GPUs (pynvml) + runtime (psutil) + service probes.¶
infolake-doctor --json infolake-doctor --gpu-only infolake-doctor --watch 5 # refresh loop, good for vLLM warmup
Live Prometheus metrics (when diagnostics.telemetry.enabled = true).¶
curl -s http://127.0.0.1:8000/api/metrics | head
Full observability stack (prometheus + grafana + dcgm-exporter).¶
make up-obs
http://127.0.0.1:9090 prometheus¶
http://127.0.0.1:3000 grafana (pre-provisioned Infolake dashboard)¶
http://127.0.0.1:9400 dcgm-exporter¶
```
Structured JSON logs are emitted on stdout when running inside a container
(INFOLAKE_LOG_FORMAT=json); host CLI runs keep the human-readable format.
Project layout¶
``` src/infolake/ core/ # config, schemas, DB handle, repositories/, llm/, graph, container backend/ # FastAPI app + /api routers pipelines/ crawling/ # orchestrator + scheduler + fetcher + writer + stage adapter enrichment/ # consumer + per-stage adapters mapping/ # diffusion map + projection + coloring + labels + stage adapter services/ # local-process managers (VLLMManager, QdrantService, Worker) diagnostics/ # gpu, runtime, services, telemetry, report extensions/ # Protocol interfaces + entry-point Registry cli/ # console scripts (doctor, services, backend, pipeline, inspect, ...) db/ # bootstrap, fts, maintenance, alembic/
docker/ # Dockerfiles, nginx conf, prometheus.yml, grafana provisioning compose.yml # production topology (+ observability profile) compose.override.yml # dev: bind-mount src, --reload, optional Vite dev server
config/ default.json # committed base config local.example.json # per-host override template (gitignored as config/local.json)
frontend/ # React + Vite app, delivered by nginx in compose tests/ # pytest suite; conftest points at a temp fixture config ```
Testing¶
```bash
Everything.¶
make test
Or, manually:¶
conda run -n infolake pytest --cov --cov-report=term
Single file.¶
conda run -n infolake pytest tests/test_mapping_pipeline.py -v ```
tests/conftest.py exports INFOLAKE_CONFIG_PATH to a temp fixture JSON, so
tests never touch the production DB or config/default.json.
Extending¶
Every cross-cutting capability is a plugin seam. Third-party packages
register implementations via pyproject.toml entry points; infolake
discovers them through importlib.metadata.entry_points at first use.
| Group | Contract | Example (in-repo) |
|---|---|---|
infolake.llm_backends |
LLMBackend |
infolake.core.llm.outlines:OutlinesLLMClient |
infolake.embedders |
Embedder |
infolake.core.embedding:TextEmbeddingClient |
infolake.fetchers |
Fetcher |
infolake.core.crawl4ai_client:Crawl4AIClient |
infolake.pipeline_stages |
PipelineStage |
infolake.pipelines.crawling.stage:CrawlStage |
infolake.projections |
Projector |
infolake.pipelines.mapping.project:UMAPProjector |
infolake.graph_edges |
GraphEdgeSource |
(names listed in diffusion_map.edges) |
Example third-party registration:
toml
[project.entry-points."infolake.llm_backends"]
vertex_ai = "my_pkg.llm:VertexAIBackend"
Then switch llm.constrained_decoding = "vertex_ai" (or wire it through the
Registry.get(...) call) — no edits to infolake required.
Developer loop¶
bash
make lint # ruff check
make fmt # ruff format + autofix
make typecheck # mypy on core + backend
make check # lint + typecheck + test
make precommit # run every pre-commit hook against all files
CI mirrors this (see .github/workflows/ci.yml):
lint-python, test-python, frontend, docker-smoke (builds backend +
frontend images, caches GPU images without running).
Deploy-host sync workflow¶
This repo is developed on one machine and deployed from another (the Mac mini
that fronts compose.public.yml + cloudflared). To keep git pull on the
deploy host boring — no conflicts, no accidental pushes of host-only tweaks —
split files into three buckets and use the rules already baked into
.gitignore.
File buckets¶
| Bucket | Examples | How it's handled |
|---|---|---|
| Shared, committed | src/, frontend/, config/default.json, compose.yml, docker/backend.Dockerfile |
Edit on the dev machine; pulled read-only on the deploy host. |
| Deploy-only, never tracked | docker/frontend.public.Dockerfile, docker/nginx/public.conf, config/public.example.json, compose.public.override.yml, .env.public |
Listed in .gitignore; live only on the deploy host. |
| Committed template, locally diverged | compose.public.yml, config/config.json, .env.public.example |
Tracked upstream as a baseline; host tweaks stay local via git update-index --skip-worktree <file>. |
One-time setup on the deploy host¶
Run once per fresh clone so local tweaks to tracked templates are invisible to git status / git pull:
```bash git update-index --skip-worktree compose.public.yml config/config.json
Optional: list everything currently skip-worktree'd¶
git ls-files -v | awk '/^S/ {print $2}' ```
To undo the flag (e.g. to pull a genuinely new upstream version of
compose.public.yml), run git update-index --no-skip-worktree <file>,
git stash, git pull, git stash pop, then re-set the flag.
Routine pull (deploy host)¶
```bash
1. Sanity check: no stray commits on the deploy host.¶
git status git log --oneline origin/main..HEAD # should print nothing
2. Fetch + fast-forward.¶
git fetch origin git pull --ff-only origin main
3. Rebuild only what moved.¶
make build # (or: docker compose -f compose.public.yml build) docker compose -f compose.public.yml up -d ```
If step 1 shows local commits on the deploy host, you probably edited a shared
file by mistake. Move the change to a deploy-only file if possible, otherwise
git reset --soft origin/main to move the commit back to the index, split out
the deploy-only bits into gitignored files, and discard the rest.
Routine push (dev machine)¶
```bash
Never git add anything under the "deploy-only" bucket above.¶
git status # should only list shared files git add -u # or explicit paths git commit -m "..." git push origin main ```
If you ever need to track a new deploy-only file¶
- Create it under one of the ignored paths (e.g.
docker/*.public.Dockerfile,docker/nginx/public.conf,config/public.example.json), or - Add its path/pattern to the "Deploy-machine-only files" block in
.gitignore, commit, push, and keep editing it locally.
Emergency: diverged with a local commit you don't need¶
```bash
Save any working-tree edits first.¶
git stash -u
Drop the local commit, align with cloud.¶
git fetch origin git reset --hard origin/main
Re-apply edits you still want.¶
git stash pop ```
Further reading¶
- CHANGELOG.md — versioned history, starting with the v5.2.0 refactor.
- spec.md — authoritative schema + pipeline + coloring spec.
- tip.md — rolling bugs-and-solutions log referenced by the cursor rules.
- CLAUDE.md — agent-facing project context.