Core Concepts

Web & Research

ctx_url_read pulls web pages, PDFs and YouTube videos into context as compressed, citation-backed text — research, docs and transcripts without leaving the agent loop.

LeanCTX also reads the web. ctx_url_read is the research layer: it fetches a public URL, an HTML page, a PDF, or a YouTube video, and returns compressed, citation-backed text the model can read, quote and reason over. It is the web counterpart of ctx_read: one tool call, one token budget, boilerplate stripped, the source preserved for citation.

This covers the agent use-cases that live beyond the codebase: reading a changelog or an RFC, pulling an API spec, summarising a blog post, or extracting the claims from a documentation page — all without pasting raw HTML into the context window.

What it reads

Web pages — HTML is parsed, the main article is extracted, navigation and ads are dropped, and the result is returned as clean Markdown or plain text. Tables are preserved as GitHub-Flavored Markdown, so tabular data survives the trip into context instead of collapsing into a wall of text.
PDFs — remote PDFs are downloaded and converted to text, so specs, papers and datasheets become readable context.
RSS / Atom feeds — a feed URL is parsed into a dated list of items (title, link, summary) instead of raw XML, so news, release and blog feeds drop straight into context.
YouTube — a video URL is resolved to its transcript and flattened into compact, quotable text.
GitHub files — github.com/…/blob/… (and raw) URLs auto-resolve to the underlying raw file, so you get the source, not the rendered page chrome.

Distillation modes

The mode argument controls how the fetched content is distilled. The default, auto, picks Markdown for pages and a transcript for videos.

Mode	What you get
`auto`	Markdown for pages, transcript for videos (default).
`markdown`	Clean Markdown of the extracted article.
`text`	Plain text, no formatting.
`links`	The outbound links on the page, for crawling.
`facts`	Key claims, each with a confidence score and source URL.
`quotes`	Verbatim quotes relevant to your query, with their source.
`transcript`	The flattened video transcript.

Citations & evidence

The facts and quotes modes go further than a summary: they return discrete claims, each carrying a confidence score and the source URL it came from. That makes web research auditable: the agent can attribute every statement to where it read it, and you can verify it later. Pass a query to focus extraction on the part of the page you actually care about.

Research compression

A single documentation page can blow a context window. ctx_url_read distils the fetched content down to a token budget (max_tokens, default 6000) using extractive, relevance-ranked compression, keeping the parts that answer your query and dropping the rest. For facts and quotes, max_items caps how many claims come back (default 12). You get the signal, not the whole page.

The active persona shapes this pipeline: flowing-text modes (auto, markdown, text, transcript) run through the persona's compressor, and budget cuts land on the persona chunker's boundaries — the research persona trims at paragraph breaks instead of mid-sentence. Extractive modes (facts, quotes, links) always stay verbatim so citations survive untouched.

Safety

Fetching is SSRF-guarded. Only http and https URLs are allowed, and requests to private, loopback and link-local addresses are blocked, so an agent cannot be steered into probing your internal network. Requests honour a timeout (timeout_secs, default 20, max 60).

Examples

Read an article in auto mode:

ctx_url_read url="https://example.com/post"

Extract claims relevant to a query, each with a source for citation:

ctx_url_read url="https://example.com/spec" mode="facts" query="rate limits"

Pull a YouTube transcript:

ctx_url_read url="https://youtu.be/VIDEO" mode="transcript"

Read a remote PDF as text within a 3000-token budget:

ctx_url_read url="https://example.com/paper.pdf" mode="text" max_tokens=3000

Arguments

Argument	Type	Default	Description
`url`	string	—	Required. The http(s) URL of a page, PDF or YouTube video.
`mode`	string	`auto`	Distillation mode (see table above).
`query`	string	—	Optional focus query; boosts relevance in `facts`/`quotes`.
`max_tokens`	integer	6000	Token budget for the returned content.
`max_items`	integer	12	Max claims for `facts`/`quotes`.
`timeout_secs`	integer	20	Request timeout in seconds (max 60).

Reading whole repositories — `ctx_git_read`

For source code, one file is rarely enough and a blob page is mostly navigation chrome. ctx_git_read reads a remote git repository the right way: it performs a cached, shallow (--depth 1) clone and lets the agent browse the file tree, read a file, or grep across the repo, returning the real source within a token budget instead of scraped HTML. Point it at a public GitHub, GitLab or Bitbucket URL; an optional blob/tree link carries the branch and path. Like ctx_url_read, it is SSRF-guarded to public https repositories, and the clone is reused across calls so repeated reads stay cheap.

ctx_git_read url="https://github.com/owner/repo"                      # tree + README
ctx_git_read url="https://github.com/owner/repo" path="src/main.rs"   # one file
ctx_git_read url="https://github.com/owner/repo" mode="grep" query="fn main"

Setup

ctx_url_read and ctx_git_read ship with the binary and are exposed automatically wherever LeanCTX runs as an MCP server, no extra configuration. If your agent connects over the standard MCP setup, the tools are already available; just call them. See the Beyond Coding journey for an end-to-end research workflow.