╭──────────────────────────────────────────╮ ╭──────────────────────────────────────────╮ │ >>>>>>>>>>>>>>> ctx_read >>>>>>>>>>>> │ │ >>>>>>>>>>>>>>> ctx_read >>>>>>>>>>>> │ ┌─────────────┐ │ >>>>>>>>>>>>>>> compress >>>>>>>>>>>> │ ┌──────────┐ ┌─────────────┐ │ >>>>>>>>>>>>>>> compress >>>>>>>>>>>> │ ┌──────────┐ │ 100K tokens │ =========>│ >>> AST >>> cache >>> signal >>> │ =========> │ ~5K tok │ │ 100K tokens │ =========>│ >>> AST >>> cache >>> signal >>> │ =========> │ ~5K tok │ └─────────────┘ │ >>>>>>>>>>>>>>> filter >>>>>>>>>>>> │ └──────────┘ └─────────────┘ │ >>>>>>>>>>>>>>> filter >>>>>>>>>>>> │ └──────────┘ │ >>>>>>>>>>>>>>> dedupe >>>>>>>>>>>> │ │ >>>>>>>>>>>>>>> dedupe >>>>>>>>>>>> │ ╔══════════════╗ ╰──────────────────────────────────────────╯ ╔══════════╗ ╔══════════════╗ ╰──────────────────────────────────────────╯ ╔══════════╗ ║ raw input ║ ─ ─ ─ ─ ─ ─ ─ ─ LeanCTX ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ║ output ║ ║ raw input ║ ─ ─ ─ ─ ─ ─ ─ ─ LeanCTX ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ║ output ║ ╚══════════════╝ ╚══════════╝ ╚══════════════╝ ╚══════════╝ ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄ ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ │░░│ │░░│ │░░│ │░░│ │▓▓│ │▓▓│ │▓▓│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │▓▓│ │▓▓│ │▓▓│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ ───> │▓▓│ │▓▓│ │▓▓│ ───> │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ ───> │▓▓│ │▓▓│ │▓▓│ ───> │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ file file file file read cache send tok tok tok tok tok tok file file file file read cache send tok tok tok tok tok tok ╭──────────────────────────────────────────╮ ╭──────────────────────────────────────────╮ │ >>>>>>>>>>>>>>> ctx_read >>>>>>>>>>>> │ │ >>>>>>>>>>>>>>> ctx_read >>>>>>>>>>>> │ ┌─────────────┐ │ >>>>>>>>>>>>>>> compress >>>>>>>>>>>> │ ┌──────────┐ ┌─────────────┐ │ >>>>>>>>>>>>>>> compress >>>>>>>>>>>> │ ┌──────────┐ │ 100K tokens │ =========>│ >>> AST >>> cache >>> signal >>> │ =========> │ ~5K tok │ │ 100K tokens │ =========>│ >>> AST >>> cache >>> signal >>> │ =========> │ ~5K tok │ └─────────────┘ │ >>>>>>>>>>>>>>> filter >>>>>>>>>>>> │ └──────────┘ └─────────────┘ │ >>>>>>>>>>>>>>> filter >>>>>>>>>>>> │ └──────────┘ │ >>>>>>>>>>>>>>> dedupe >>>>>>>>>>>> │ │ >>>>>>>>>>>>>>> dedupe >>>>>>>>>>>> │ ╔══════════════╗ ╰──────────────────────────────────────────╯ ╔══════════╗ ╔══════════════╗ ╰──────────────────────────────────────────╯ ╔══════════╗ ║ raw input ║ ─ ─ ─ ─ ─ ─ ─ ─ LeanCTX ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ║ output ║ ║ raw input ║ ─ ─ ─ ─ ─ ─ ─ ─ LeanCTX ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ║ output ║ ╚══════════════╝ ╚══════════╝ ╚══════════════╝ ╚══════════╝ ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄ ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ │░░│ │░░│ │░░│ │░░│ │▓▓│ │▓▓│ │▓▓│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │▓▓│ │▓▓│ │▓▓│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ ───> │▓▓│ │▓▓│ │▓▓│ ───> │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ ───> │▓▓│ │▓▓│ │▓▓│ ───> │░░│ │░░│ │░░│ │░░│ │░░│ │░░│ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ └──┘ file file file file read cache send tok tok tok tok tok tok file file file file read cache send tok tok tok tok tok tok

Use case · Scrapers & crawlers

Scrape raw.
Feed lean.

LeanCTX turns scraped pages into model-ready context: ctx_url_read ingests HTML, PDF, RSS and YouTube transcripts, strips boilerplate, deduplicates across pages and extracts facts and quotes. Your crawler stays simple. The context layer makes its output 60–90% smaller before any model sees it.

Use it after your scraper See intake formats

The problem

What it costs you today.

Web sludge eats your token budget

Navigation, cookie banners, footers, ads: most of a scraped page is boilerplate you pay to embed and pay again to prompt.

Duplicates multiply silently

The same article appears on five URLs. Without content-aware deduplication you store and process it five times.

Raw dumps are unsearchable

A folder of scraped HTML is not a knowledge base. Your agent needs ranked retrieval, not 10,000 files.

Shipped today

The capabilities that do the work.

Everything below ships in the open-source binary today. No roadmap items, no waitlists.

Your tools LeanCTX Model

Universal intake ctx_url_read handles HTML, PDF, RSS feeds and YouTube transcripts

Facts & quotes modes pages collapse into attributable facts and verbatim quotes

Deduplication content-hash dedup across pages, sessions and crawls

BM25 + graph search everything ingested becomes locally searchable and rankable

Local archive originals stay retrievable; compression is never a dead end

Quickstart

From zero to first gain.

# ingest a page as attributable facts

$ ctx_url_read("https://example.com/article", mode="facts")

# ingest a feed; items arrive dated and deduplicated

$ ctx_url_read("https://news.site/feed.xml")

# search everything you ingested

$ lean-ctx grep "quarterly revenue"

# retrieve a full original when needed

$ ctx_retrieve(id)

Go deeper

One guide. Two journeys. Full reference.

guide Data sources guide Every intake format and how it is normalized. journey Beyond coding: web research From URL to cited answer, end to end. journey Memory & knowledge How ingested content persists and ranks. reference Read modes reference facts, quotes, entropy and friends.

FAQ

Questions teams ask before adopting.

Does LeanCTX replace my scraper?

No. It sits after your scraper. Keep Playwright, Scrapy or curl; pipe the output through LeanCTX and your models receive deduplicated, compressed, searchable context instead of raw HTML.

What input formats are supported?

HTML pages, PDFs, CSV, RSS/Atom feeds, plain text, email and YouTube transcripts. Each format has its own normalization strategy before compression.

Can I get the original page back?

Always. Every original is archived locally and retrievable via ctx_retrieve. Compression in LeanCTX is reversible by design.

Take back control of your context.

Free for local use, forever. CI enforces it. One binary, ten minutes to the first measured gain.

Use it after your scraper See plans

Scrape raw.Feed lean.