Core Concepts

Request Compression (the Proxy)

LeanCTX works on two planes. The read path compresses what your agent reads; the wire path is an optional local proxy that compresses every request to the model — system prompt, full history and tool results — before it is sent, with prompt-cache safety and measured spend. You do not need a separate request-compression proxy on top.

LeanCTX works on two planes. Most of the docs describe the read path (agent → data): file reads, code search and shell output, compressed before your agent ever sees them. This page is about the other plane — the wire path (agent → model): an optional local proxy that compresses every request before it is sent to the provider. Read-side savings trim what enters the context window; wire-side savings trim every request you pay for, turn after turn.

What the proxy is

A local reverse proxy that sits between your AI tool and the model provider — Anthropic (/v1/messages), OpenAI (/v1/chat/completions and /v1/responses), Gemini, plus a Codex WebSocket bridge. It binds to 127.0.0.1 only, authenticates every request, and forwards your tool's own provider key verbatim — it never injects credentials of its own. It is compiled into the default binary; you just turn it on.

What it compresses

On every request, the proxy compresses the parts that grow without bound: the system prompt, the full conversation history, and tool results. This is the request-side counterpart to read-side compression: the same engine that shrinks a file read also shrinks the payload leaving for the model. After 10–15 turns a chat can carry 100K+ tokens of history with every new request — the proxy is what keeps that in check automatically, instead of asking you to start fresh chats.

Prompt-cache safe by design

Naive request compression breaks provider prompt caching (Anthropic, OpenAI), which would cost more than it saves. The proxy avoids that: history pruning and a cold-prefix repack only rewrite the parts of the request that are not a live cache prefix, so the cache stays warm and the discount is preserved. Compression that would invalidate a hot cache is skipped.

Active prompt-cache breakpoints v3.8.14

Beyond preserving a cache the client set up, the proxy can create one a raw API client left on the table. With proxy.cache_breakpoint enabled (env LEAN_CTX_PROXY_CACHE_BREAKPOINT, off by default), it adds a single ephemeral cache_control marker to the system field of an Anthropic request — but only when the client set none of its own. A large, stable system prompt then bills later turns at the cached rate instead of full price every turn.

It is Anthropic-only by construction (OpenAI and Gemini cache prefixes automatically and ignore the marker, so those paths stay byte-unchanged), deterministic (a pure function of the body, so the prefix it creates is itself byte-stable), never adds a second breakpoint, and is skipped below Anthropic's minimum cacheable size so it never churns bytes for no cache. Every injection is counted on the breakpoints_injected gauge in /status — a pure win signal, never against the cache-safe ratio.

A companion measurement, proxy.cache_aligner (on by default — it is measurement-only and strictly cache-safe), scans each unanchored system prompt for cache-busting volatile tokens — today's date, a fresh UUID, a git SHA — and reports how many it found (volatile_fields_detected) so you can quantify how much prompt-cache your prompt leaks. The request body is never mutated; set it to false to drop the per-request scan.

proxy.cache_policy — cache-economics, also on by default — adds the diagnosis layer: every anchored turn is classified by why it missed the prompt-cache (cold start, TTL lapse, or prefix change) and surfaced as cumulative /status gauges, paired with a net-cost gate that refuses to re-seed a prefix too small to ever be cached. Both halves are strictly safe — the telemetry never touches the body and the gate only makes a repack more conservative — so they ship on out of the box.

Deterministic structural crushing v3.8.14

Tool results are often array-heavy JSON — API responses, kubectl get -o json, DB dumps, RAG chunks — that repeat the same keys and values on every row. The json_crush engine factors that redundancy out: every key shared by all items is hoisted to a _defaults block and only per-row deviations are kept, so the result is exactly reconstructible and never inflates. The output is a pure function of the input (candidate keys and value frequencies walk ordered sets), so it is byte-stable and never leaks hash-map order — the determinism that keeps the prompt-cache prefix warm.

On the read and shell path the same core powers the generic JSON fallback and an opt-in mode for otherwise-verbatim data commands (gh api, jq, curl JSON) via crush_verbatim_json, firing only when it at least halves the payload. A dropped high-entropy column (timestamps, UUIDs) stays recoverable through a content-addressed ctx_expand handle, so a datum is never lost.

The same columnar core now reaches beyond JSON. A CSV/TSV crusher hoists every column that repeats a single value across all rows into a _const block and keeps the rest positionally; a YAML crusher folds YAML into the compact JSON form and factors its shared structure. Both are deterministic and never inflate, and on their lossy path any dropped high-entropy column stays recoverable through the same ctx_expand handle. The ctx_compare tool previews any pipeline before you trust it — original vs the exact bytes lean-ctx would emit, with token counts and a line diff.

Adaptive aggressiveness v3.8.14

Compression that is too aggressive shows itself: the agent keeps pulling back the originals it dropped. LeanCTX treats ctx_expand/ctx_retrieve re-fetches as exactly that signal and, combined with read/run correction loops, dials a session's compression down to Lite (3+ signals) then Off (5+), recovering only when the pressure clears. The level is server state that feeds future decisions — never part of any tool-output body, so output stays deterministic.

Cross-provider effort control — cache-safe by design

The proxy can pin one reasoning-effort level across every provider with a single setting (proxy.effort = minimal | low | medium | high). LeanCTX translates that one level into each provider's native parameter — OpenAI reasoning_effort / reasoning.effort, Anthropic output_config.effort, and Gemini thinkingConfig (thinkingLevel on 3.x, thinkingBudget on 2.5) — so you dial reasoning depth, and the reasoning-token bill that comes with it, once instead of per tool and per model.

Crucially the level is a constant for the whole conversation, not a per-turn decision — and that is deliberate. Providers list a change in reasoning effort as a cache-invalidation cause (OpenAI), and Anthropic breaks its message-cache breakpoints when the thinking configuration changes between turns. So "effort routing" that flips the level turn-by-turn quietly throws away the 50–90% prompt-cache discount — usually costing more than the reasoning tokens it saves. LeanCTX keeps the level byte-stable across turns, so the cached prefix stays warm and only the model's reasoning depth changes.

It is opt-in and conservative: off (the default) is a strict no-op; it never overrides an effort the client set itself; it only ever touches models that accept the parameter, so it can never turn a working request into a 400. On Anthropic it dials only a request that already asked for adaptive thinking, and on Gemini it skips 2.5 flash-lite (thinking off by default) and never sends both thinking fields — so it never adds reasoning cost you didn't request. lean-ctx proxy status reports the active level and how many requests were steered per provider.

Turn it on

lean-ctx proxy enable      # config flag + autostart service + endpoint wiring (port 4444)
lean-ctx proxy status      # requests, compression ratio, tokens saved, measured USD spend
lean-ctx proxy disable     # restore the original endpoint

enable wires your AI tool's ANTHROPIC_BASE_URL / OPENAI_BASE_URL to the local proxy and installs the autostart service. The proxy is opt-in: until you enable it, LeanCTX only touches the read path. A Claude Pro/Max subscription authenticates against api.anthropic.com directly and is deliberately not routed; set ANTHROPIC_API_KEY to route Claude through the proxy.

Measured, not estimated, spend

Because every response flows back through the proxy, it reads the real billed tokens from upstream — including cache reads/writes and reasoning tokens — and reports your actual provider bill per model. Token savings shown are request-side (tokens removed before forwarding); lean-ctx proxy status is the single source of truth for "where is my traffic going, and what did it cost".

The same engine, callable from your app

The proxy exposes POST /v1/compress — a deterministic, prompt-cache-friendly messages-in / messages-out contract. It is the same compression you get from the published SDK, so you can compress a chat array yourself before it reaches any model:

# pip install lean-ctx-sdk
from lean_ctx import compress
messages = compress(messages, model="claude-sonnet-4")

Do I need a separate request-compression proxy?

No. A standalone request-compression proxy and the LeanCTX proxy occupy the same plane — the wire path — and LeanCTX already ships it, prompt-cache-safe and metered. Run lean-ctx proxy enable and you have read-side and wire-side compression from one local binary, with one verifiable savings ledger across both.

Where to go next

Token Savings — read-side vs. wire-side, and where the tokens go.
Configuration → Proxy — enable/disable, port and per-provider upstreams.
API Reference — /v1/compress and the SDK surface.
CLI Reference — lean-ctx proxy subcommands.