Core Concepts
Web & Research
ctx_url_read pulls web pages, PDFs and YouTube videos into context as compressed, citation-backed text — research, docs and transcripts without leaving the agent loop.
LeanCTX is not only for code. ctx_url_read is the web-research layer: it fetches a
public URL — an HTML page, a PDF, or a YouTube video — and returns compressed,
citation-backed text the model can read, quote and reason over. It is the web
counterpart of ctx_read: one tool call, one token budget, boilerplate stripped, the
source preserved for citation.
This covers the agent use-cases that live beyond the codebase: reading a changelog or an RFC, pulling an API spec, summarising a blog post, or extracting the claims from a documentation page — all without pasting raw HTML into the context window.
What it reads
- Web pages — HTML is parsed, the main article is extracted, navigation and ads are dropped, and the result is returned as clean Markdown or plain text.
- PDFs — remote PDFs are downloaded and converted to text, so specs, papers and datasheets become readable context.
- YouTube — a video URL is resolved to its transcript and flattened into compact, quotable text.
Distillation modes
The mode argument controls how the fetched content is distilled. The default,
auto, picks Markdown for pages and a transcript for videos.
| Mode | What you get |
|---|---|
auto | Markdown for pages, transcript for videos (default). |
markdown | Clean Markdown of the extracted article. |
text | Plain text, no formatting. |
links | The outbound links on the page, for crawling. |
facts | Key claims, each with a confidence score and source URL. |
quotes | Verbatim quotes relevant to your query, with their source. |
transcript | The flattened video transcript. |
Citations & evidence
The facts and quotes modes do not just summarise — they return discrete
claims, each carrying a confidence score and the source URL it
came from. That makes web research auditable: the agent can attribute every statement to where it
read it, and you can verify it later. Pass a query to focus extraction on the part of
the page you actually care about.
Research compression
A single documentation page can blow a context window. ctx_url_read distils the
fetched content down to a token budget (max_tokens, default 6000) using extractive,
relevance-ranked compression — keeping the parts that answer your query and dropping the rest.
For facts and quotes, max_items caps how many claims come
back (default 12). You get the signal, not the whole page.
Safety
Fetching is SSRF-guarded. Only http and https URLs are allowed, and
requests to private, loopback and link-local addresses are blocked — so an agent cannot be steered
into probing your internal network. Requests honour a timeout (timeout_secs, default
20, max 60).
Examples
Read an article in auto mode:
ctx_url_read url="https://example.com/post" Extract claims relevant to a query, each with a source for citation:
ctx_url_read url="https://example.com/spec" mode="facts" query="rate limits" Pull a YouTube transcript:
ctx_url_read url="https://youtu.be/VIDEO" mode="transcript" Read a remote PDF as text within a 3000-token budget:
ctx_url_read url="https://example.com/paper.pdf" mode="text" max_tokens=3000 Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
url | string | — | Required. The http(s) URL of a page, PDF or YouTube video. |
mode | string | auto | Distillation mode (see table above). |
query | string | — | Optional focus query; boosts relevance in facts/quotes. |
max_tokens | integer | 6000 | Token budget for the returned content. |
max_items | integer | 12 | Max claims for facts/quotes. |
timeout_secs | integer | 20 | Request timeout in seconds (max 60). |
Setup
ctx_url_read ships with the binary and is exposed automatically wherever LeanCTX runs
as an MCP server — no extra configuration. If your agent connects over the
standard MCP setup, the tool is already available; just call
it. See the Beyond Coding journey for an
end-to-end research workflow.