What Infolake is¶
Infolake (internally: Truth Atlas) is an end-to-end pipeline that turns a seed list of URLs into a browsable 2-D map of the web, coloured by topic and connected by citation / semantic edges.
It is one system, not a framework: a specific set of choices about how to crawl, how to read, how to embed, how to lay out, and how to serve. Every choice is swappable through entry-point plugins, but opinionated defaults exist for all of them.
The five stages¶
- Crawl. A scheduler walks the seed list, picking BFS or DFS per the
strategy_scheduleknob, respecting robots.txt and domain rate limits. Output: HTML, metadata, extracted citations. - Enrich — documents. Each page goes through a local LLM (Qwen via
vLLM, constrained by Outlines or retried through
instructor) to produce a summary, a quality verdict, and topic tags. - Enrich — passages. Pages are chunked; each chunk gets a claim extraction pass that yields short atomic claims with source spans.
- Embed. BGE-small embeddings are written to Qdrant for documents, passages, and claims. Structural metadata lives in SQLite + FTS5.
- Map. A diffusion-map over citation, semantic, and claim-overlap edges produces 2-D coordinates plus a colouring that uses three eigenvectors for hue / saturation / lightness. Region labels come from per-cluster topic summarisation.
The runtime shape¶
- One FastAPI backend exposing
/api/health,/api/ready,/api/map-data,/api/docs,/api/candidates,/api/routes,/api/stars. - One React + Vite + deck.gl frontend, served by nginx in production and by Vite in dev.
- One worker container running the persistent
enrich-loop. - One vLLM container holding the model, bound to all GPUs.
- One Qdrant container for vectors.
- Optional observability stack: Prometheus, Grafana, dcgm-exporter.
All of this runs on one workstation. Tested on 5× RTX 3090 with CUDA 12.8. It is absolutely not designed for a Kubernetes cluster; it is designed to be a single, reproducible box.
What it is not¶
- Not a search engine. Ranking is secondary; the primary output is a spatial layout, not a relevance-ordered list.
- Not a knowledge graph. Claims are extracted, but no canonical entity resolution layer exists yet.
- Not VR or 3-D. The "map" is a 2-D scatter rendered with deck.gl. The spatial metaphor is the point; the third dimension would hurt legibility.
- Not a hosted product. You run it yourself, on your own GPUs, on your own seed list.
When to use it¶
- You want a topology-first view of a corpus you control (a reading list, a research area, a competitor landscape).
- You are comfortable running Docker + one GPU workstation.
- You want every component to be inspectable and replaceable.
When not to use it¶
- You need a managed, zero-ops search solution → use a hosted search API.
- You need fresh results every minute over the whole web → Infolake is a batch / soak system, not a real-time index.
- You have no GPUs and don't want to rent any → the enrichment stage will be very slow.
Next stop: Architecture.