Use case · Scrapers & crawlers

Scrape raw.
Feed lean.

LeanCTX turns scraped pages into model-ready context: ctx_url_read ingests HTML, PDF, RSS and YouTube transcripts, strips boilerplate, deduplicates across pages and extracts facts and quotes. Your crawler stays simple. The context layer makes its output 60–90% smaller before any model sees it.

The problem

What it costs you today.

01

Web sludge eats your token budget

Navigation, cookie banners, footers, ads: most of a scraped page is boilerplate you pay to embed and pay again to prompt.

02

Duplicates multiply silently

The same article appears on five URLs. Without content-aware deduplication you store and process it five times.

03

Raw dumps are unsearchable

A folder of scraped HTML is not a knowledge base. Your agent needs ranked retrieval, not 10,000 files.

Shipped today

The capabilities that do the work.

Everything below ships in the open-source binary today. No roadmap items, no waitlists.

Your tools LeanCTX Model
Universal intake ctx_url_read handles HTML, PDF, RSS feeds and YouTube transcripts
Facts & quotes modes pages collapse into attributable facts and verbatim quotes
Deduplication content-hash dedup across pages, sessions and crawls
BM25 + graph search everything ingested becomes locally searchable and rankable
Local archive originals stay retrievable; compression is never a dead end
Quickstart

From zero to first gain.

# ingest a page as attributable facts
$ ctx_url_read("https://example.com/article", mode="facts")
# ingest a feed; items arrive dated and deduplicated
$ ctx_url_read("https://news.site/feed.xml")
# search everything you ingested
$ lean-ctx grep "quarterly revenue"
# retrieve a full original when needed
$ ctx_retrieve(id)
FAQ

Questions teams ask before adopting.

Does LeanCTX replace my scraper?

No. It sits after your scraper. Keep Playwright, Scrapy or curl; pipe the output through LeanCTX and your models receive deduplicated, compressed, searchable context instead of raw HTML.

What input formats are supported?

HTML pages, PDFs, CSV, RSS/Atom feeds, plain text, email and YouTube transcripts. Each format has its own normalization strategy before compression.

Can I get the original page back?

Always. Every original is archived locally and retrievable via ctx_retrieve. Compression in LeanCTX is reversible by design.

Take back control of your context.

Free for local use, forever. CI enforces it. One binary, ten minutes to the first measured gain.