Scrape raw.
Feed lean.
LeanCTX turns scraped pages into model-ready context: ctx_url_read ingests HTML, PDF, RSS and YouTube transcripts, strips boilerplate, deduplicates across pages and extracts facts and quotes. Your crawler stays simple. The context layer makes its output 60–90% smaller before any model sees it.
What it costs you today.
Web sludge eats your token budget
Navigation, cookie banners, footers, ads: most of a scraped page is boilerplate you pay to embed and pay again to prompt.
Duplicates multiply silently
The same article appears on five URLs. Without content-aware deduplication you store and process it five times.
Raw dumps are unsearchable
A folder of scraped HTML is not a knowledge base. Your agent needs ranked retrieval, not 10,000 files.
The capabilities that do the work.
Everything below ships in the open-source binary today. No roadmap items, no waitlists.
From zero to first gain.
One guide. Two journeys. Full reference.
Questions teams ask before adopting.
Does LeanCTX replace my scraper?
No. It sits after your scraper. Keep Playwright, Scrapy or curl; pipe the output through LeanCTX and your models receive deduplicated, compressed, searchable context instead of raw HTML.
What input formats are supported?
HTML pages, PDFs, CSV, RSS/Atom feeds, plain text, email and YouTube transcripts. Each format has its own normalization strategy before compression.
Can I get the original page back?
Always. Every original is archived locally and retrievable via ctx_retrieve. Compression in LeanCTX is reversible by design.
Take back control of your context.
Free for local use, forever. CI enforces it. One binary, ten minutes to the first measured gain.