Core Concepts

Web & Research

ctx_url_read pulls web pages, PDFs and YouTube videos into context as compressed, citation-backed text — research, docs and transcripts without leaving the agent loop.

LeanCTX is not only for code. ctx_url_read is the web-research layer: it fetches a public URL — an HTML page, a PDF, or a YouTube video — and returns compressed, citation-backed text the model can read, quote and reason over. It is the web counterpart of ctx_read: one tool call, one token budget, boilerplate stripped, the source preserved for citation.

This covers the agent use-cases that live beyond the codebase: reading a changelog or an RFC, pulling an API spec, summarising a blog post, or extracting the claims from a documentation page — all without pasting raw HTML into the context window.

What it reads

  • Web pages — HTML is parsed, the main article is extracted, navigation and ads are dropped, and the result is returned as clean Markdown or plain text.
  • PDFs — remote PDFs are downloaded and converted to text, so specs, papers and datasheets become readable context.
  • YouTube — a video URL is resolved to its transcript and flattened into compact, quotable text.

Distillation modes

The mode argument controls how the fetched content is distilled. The default, auto, picks Markdown for pages and a transcript for videos.

ModeWhat you get
autoMarkdown for pages, transcript for videos (default).
markdownClean Markdown of the extracted article.
textPlain text, no formatting.
linksThe outbound links on the page, for crawling.
factsKey claims, each with a confidence score and source URL.
quotesVerbatim quotes relevant to your query, with their source.
transcriptThe flattened video transcript.

Citations & evidence

The facts and quotes modes do not just summarise — they return discrete claims, each carrying a confidence score and the source URL it came from. That makes web research auditable: the agent can attribute every statement to where it read it, and you can verify it later. Pass a query to focus extraction on the part of the page you actually care about.

Research compression

A single documentation page can blow a context window. ctx_url_read distils the fetched content down to a token budget (max_tokens, default 6000) using extractive, relevance-ranked compression — keeping the parts that answer your query and dropping the rest. For facts and quotes, max_items caps how many claims come back (default 12). You get the signal, not the whole page.

Safety

Fetching is SSRF-guarded. Only http and https URLs are allowed, and requests to private, loopback and link-local addresses are blocked — so an agent cannot be steered into probing your internal network. Requests honour a timeout (timeout_secs, default 20, max 60).

Examples

Read an article in auto mode:

ctx_url_read url="https://example.com/post"

Extract claims relevant to a query, each with a source for citation:

ctx_url_read url="https://example.com/spec" mode="facts" query="rate limits"

Pull a YouTube transcript:

ctx_url_read url="https://youtu.be/VIDEO" mode="transcript"

Read a remote PDF as text within a 3000-token budget:

ctx_url_read url="https://example.com/paper.pdf" mode="text" max_tokens=3000

Arguments

ArgumentTypeDefaultDescription
urlstringRequired. The http(s) URL of a page, PDF or YouTube video.
modestringautoDistillation mode (see table above).
querystringOptional focus query; boosts relevance in facts/quotes.
max_tokensinteger6000Token budget for the returned content.
max_itemsinteger12Max claims for facts/quotes.
timeout_secsinteger20Request timeout in seconds (max 60).

Setup

ctx_url_read ships with the binary and is exposed automatically wherever LeanCTX runs as an MCP server — no extra configuration. If your agent connects over the standard MCP setup, the tool is already available; just call it. See the Beyond Coding journey for an end-to-end research workflow.