Skip to content

What Infolake is

Infolake (internally: Truth Atlas) is an end-to-end pipeline that turns a seed list of URLs into a browsable 2-D map of the web, coloured by topic and connected by citation / semantic edges.

It is one system, not a framework: a specific set of choices about how to crawl, how to read, how to embed, how to lay out, and how to serve. Every choice is swappable through entry-point plugins, but opinionated defaults exist for all of them.

The five stages

  1. Crawl. A scheduler walks the seed list, picking BFS or DFS per the strategy_schedule knob, respecting robots.txt and domain rate limits. Output: HTML, metadata, extracted citations.
  2. Enrich — documents. Each page goes through a local LLM (Qwen via vLLM, constrained by Outlines or retried through instructor) to produce a summary, a quality verdict, and topic tags.
  3. Enrich — passages. Pages are chunked; each chunk gets a claim extraction pass that yields short atomic claims with source spans.
  4. Embed. BGE-small embeddings are written to Qdrant for documents, passages, and claims. Structural metadata lives in SQLite + FTS5.
  5. Map. A diffusion-map over citation, semantic, and claim-overlap edges produces 2-D coordinates plus a colouring that uses three eigenvectors for hue / saturation / lightness. Region labels come from per-cluster topic summarisation.

The runtime shape

  • One FastAPI backend exposing /api/health, /api/ready, /api/map-data, /api/docs, /api/candidates, /api/routes, /api/stars.
  • One React + Vite + deck.gl frontend, served by nginx in production and by Vite in dev.
  • One worker container running the persistent enrich-loop.
  • One vLLM container holding the model, bound to all GPUs.
  • One Qdrant container for vectors.
  • Optional observability stack: Prometheus, Grafana, dcgm-exporter.

All of this runs on one workstation. Tested on 5× RTX 3090 with CUDA 12.8. It is absolutely not designed for a Kubernetes cluster; it is designed to be a single, reproducible box.

What it is not

  • Not a search engine. Ranking is secondary; the primary output is a spatial layout, not a relevance-ordered list.
  • Not a knowledge graph. Claims are extracted, but no canonical entity resolution layer exists yet.
  • Not VR or 3-D. The "map" is a 2-D scatter rendered with deck.gl. The spatial metaphor is the point; the third dimension would hurt legibility.
  • Not a hosted product. You run it yourself, on your own GPUs, on your own seed list.

When to use it

  • You want a topology-first view of a corpus you control (a reading list, a research area, a competitor landscape).
  • You are comfortable running Docker + one GPU workstation.
  • You want every component to be inspectable and replaceable.

When not to use it

  • You need a managed, zero-ops search solution → use a hosted search API.
  • You need fresh results every minute over the whole web → Infolake is a batch / soak system, not a real-time index.
  • You have no GPUs and don't want to rent any → the enrichment stage will be very slow.

Next stop: Architecture.