What Infolake is¶

Infolake (internally: Truth Atlas) is an end-to-end pipeline that turns a seed list of URLs into a browsable 2-D map of the web, coloured by topic and connected by citation / semantic edges.

It is one system, not a framework: a specific set of choices about how to crawl, how to read, how to embed, how to lay out, and how to serve. Every choice is swappable through entry-point plugins, but opinionated defaults exist for all of them.

The five stages¶

Crawl. A scheduler walks the seed list, picking BFS or DFS per the strategy_schedule knob, respecting robots.txt and domain rate limits. Output: HTML, metadata, extracted citations.
Enrich — documents. Each page goes through a local LLM (Qwen via vLLM, constrained by Outlines or retried through instructor) to produce a summary, a quality verdict, and topic tags.
Enrich — passages. Pages are chunked; each chunk gets a claim extraction pass that yields short atomic claims with source spans.
Embed. BGE-small embeddings are written to Qdrant for documents, passages, and claims. Structural metadata lives in SQLite + FTS5.
Map. A diffusion-map over citation, semantic, and claim-overlap edges produces 2-D coordinates plus a colouring that uses three eigenvectors for hue / saturation / lightness. Region labels come from per-cluster topic summarisation.

The runtime shape¶

One FastAPI backend exposing /api/health, /api/ready, /api/map-data, /api/docs, /api/candidates, /api/routes, /api/stars.
One React + Vite + deck.gl frontend, served by nginx in production and by Vite in dev.
One worker container running the persistent enrich-loop.
One vLLM container holding the model, bound to all GPUs.
One Qdrant container for vectors.
Optional observability stack: Prometheus, Grafana, dcgm-exporter.

All of this runs on one workstation. Tested on 5× RTX 3090 with CUDA 12.8. It is absolutely not designed for a Kubernetes cluster; it is designed to be a single, reproducible box.

What it is not¶

Not a search engine. Ranking is secondary; the primary output is a spatial layout, not a relevance-ordered list.
Not a knowledge graph. Claims are extracted, but no canonical entity resolution layer exists yet.
Not VR or 3-D. The "map" is a 2-D scatter rendered with deck.gl. The spatial metaphor is the point; the third dimension would hurt legibility.
Not a hosted product. You run it yourself, on your own GPUs, on your own seed list.

When to use it¶

You want a topology-first view of a corpus you control (a reading list, a research area, a competitor landscape).
You are comfortable running Docker + one GPU workstation.
You want every component to be inspectable and replaceable.

When not to use it¶

You need a managed, zero-ops search solution → use a hosted search API.
You need fresh results every minute over the whole web → Infolake is a batch / soak system, not a real-time index.
You have no GPUs and don't want to rent any → the enrichment stage will be very slow.

Next stop: Architecture.