Core Concepts
Request Compression (the Proxy)
LeanCTX works on two planes. The read path compresses what your agent reads; the wire path is an optional local proxy that compresses every request to the model — system prompt, full history and tool results — before it is sent, with prompt-cache safety and measured spend. You do not need a separate request-compression proxy on top.
LeanCTX works on two planes. Most of the docs describe the
read path (agent → data): file reads, code search and shell output,
compressed before your agent ever sees them. This page is about the other plane — the
wire path (agent → model): an optional local proxy that compresses
every request before it is sent to the provider. Read-side savings trim what enters the
context window; wire-side savings trim every request you pay for, turn after turn.
What the proxy is
A local reverse proxy that sits between your AI tool and the model provider — Anthropic
(/v1/messages), OpenAI (/v1/chat/completions and
/v1/responses), Gemini, plus a Codex WebSocket bridge. It binds to
127.0.0.1 only, authenticates every request, and forwards your tool's own provider
key verbatim — it never injects credentials of its own. It is compiled into the default binary;
you just turn it on.
What it compresses
On every request, the proxy compresses the parts that grow without bound: the system prompt, the full conversation history, and tool results. This is the request-side counterpart to read-side compression: the same engine that shrinks a file read also shrinks the payload leaving for the model. After 10–15 turns a chat can carry 100K+ tokens of history with every new request — the proxy is what keeps that in check automatically, instead of asking you to start fresh chats.
Prompt-cache safe by design
Naive request compression breaks provider prompt caching (Anthropic, OpenAI), which would cost more than it saves. The proxy avoids that: history pruning and a cold-prefix repack only rewrite the parts of the request that are not a live cache prefix, so the cache stays warm and the discount is preserved. Compression that would invalidate a hot cache is skipped.
Active prompt-cache breakpoints v3.8.14
Beyond preserving a cache the client set up, the proxy can create one a
raw API client left on the table. With proxy.cache_breakpoint enabled (env
LEAN_CTX_PROXY_CACHE_BREAKPOINT, off by default), it adds a single ephemeral
cache_control marker to the system field of an Anthropic request —
but only when the client set none of its own. A large, stable system prompt then bills
later turns at the cached rate instead of full price every turn.
It is Anthropic-only by construction (OpenAI and Gemini cache prefixes automatically and ignore
the marker, so those paths stay byte-unchanged), deterministic (a pure function of the body, so
the prefix it creates is itself byte-stable), never adds a second breakpoint, and is skipped below
Anthropic's minimum cacheable size so it never churns bytes for no cache. Every injection is
counted on the breakpoints_injected gauge in /status — a pure win signal,
never against the cache-safe ratio.
A companion measurement, proxy.cache_aligner (on by default — it is
measurement-only and strictly cache-safe), scans each unanchored system prompt for
cache-busting volatile tokens — today's date, a fresh UUID, a git SHA — and reports how many it
found (volatile_fields_detected) so you can quantify how much prompt-cache your prompt
leaks. The request body is never mutated; set it to false to drop the per-request scan.
proxy.cache_policy — cache-economics, also on by default — adds the
diagnosis layer: every anchored turn is classified by why it missed the prompt-cache (cold
start, TTL lapse, or prefix change) and surfaced as cumulative /status gauges, paired
with a net-cost gate that refuses to re-seed a prefix too small to ever be cached. Both halves are
strictly safe — the telemetry never touches the body and the gate only makes a repack more
conservative — so they ship on out of the box.
Deterministic structural crushing v3.8.14
Tool results are often array-heavy JSON — API responses, kubectl get -o json, DB
dumps, RAG chunks — that repeat the same keys and values on every row. The json_crush
engine factors that redundancy out: every key shared by all items is hoisted to a
_defaults block and only per-row deviations are kept, so the result is
exactly reconstructible and never inflates. The output is a pure function of the
input (candidate keys and value frequencies walk ordered sets), so it is byte-stable and never
leaks hash-map order — the determinism that keeps the prompt-cache prefix warm.
On the read and shell path the same core powers the generic JSON fallback and an opt-in mode for
otherwise-verbatim data commands (gh api, jq, curl JSON) via
crush_verbatim_json, firing only when it at least halves the payload. A dropped
high-entropy column (timestamps, UUIDs) stays recoverable through a content-addressed
ctx_expand handle, so a datum is never lost.
The same columnar core now reaches beyond JSON. A CSV/TSV crusher hoists every
column that repeats a single value across all rows into a _const block and keeps the
rest positionally; a YAML crusher folds YAML into the compact JSON form and factors
its shared structure. Both are deterministic and never inflate, and on their lossy path any dropped
high-entropy column stays recoverable through the same ctx_expand handle. The
ctx_compare tool previews any pipeline before you trust it — original vs the exact bytes
lean-ctx would emit, with token counts and a line diff.
Adaptive aggressiveness v3.8.14
Compression that is too aggressive shows itself: the agent keeps pulling back the originals it
dropped. LeanCTX treats ctx_expand/ctx_retrieve re-fetches as exactly
that signal and, combined with read/run correction loops, dials a session's compression down to
Lite (3+ signals) then Off (5+), recovering only when the pressure
clears. The level is server state that feeds future decisions — never part of any tool-output
body, so output stays deterministic.
Cross-provider effort control — cache-safe by design
The proxy can pin one reasoning-effort level across every provider with a single
setting (proxy.effort = minimal | low | medium | high). LeanCTX translates that one
level into each provider's native parameter — OpenAI reasoning_effort /
reasoning.effort, Anthropic output_config.effort, and Gemini
thinkingConfig (thinkingLevel on 3.x, thinkingBudget on 2.5)
— so you dial reasoning depth, and the reasoning-token bill that comes with it, once instead of
per tool and per model.
Crucially the level is a constant for the whole conversation, not a per-turn decision — and that is deliberate. Providers list a change in reasoning effort as a cache-invalidation cause (OpenAI), and Anthropic breaks its message-cache breakpoints when the thinking configuration changes between turns. So "effort routing" that flips the level turn-by-turn quietly throws away the 50–90% prompt-cache discount — usually costing more than the reasoning tokens it saves. LeanCTX keeps the level byte-stable across turns, so the cached prefix stays warm and only the model's reasoning depth changes.
It is opt-in and conservative: off (the default) is a strict no-op;
it never overrides an effort the client set itself; it only ever touches models that accept the
parameter, so it can never turn a working request into a 400. On Anthropic it dials
only a request that already asked for adaptive thinking, and on Gemini it skips 2.5
flash-lite (thinking off by default) and never sends both thinking fields — so it never adds
reasoning cost you didn't request. lean-ctx proxy status reports the active level and
how many requests were steered per provider.
Turn it on
lean-ctx proxy enable # config flag + autostart service + endpoint wiring (port 4444)
lean-ctx proxy status # requests, compression ratio, tokens saved, measured USD spend
lean-ctx proxy disable # restore the original endpoint enable wires your AI tool's ANTHROPIC_BASE_URL /
OPENAI_BASE_URL to the local proxy and installs the autostart service. The proxy is
opt-in: until you enable it, LeanCTX only touches the read path. A Claude
Pro/Max subscription authenticates against api.anthropic.com directly and is
deliberately not routed; set ANTHROPIC_API_KEY to route Claude through the
proxy.
Measured, not estimated, spend
Because every response flows back through the proxy, it reads the real billed tokens
from upstream — including cache reads/writes and reasoning tokens — and reports your actual
provider bill per model. Token savings shown are request-side (tokens removed before forwarding);
lean-ctx proxy status is the single source of truth for "where is my traffic going,
and what did it cost".
The same engine, callable from your app
The proxy exposes POST /v1/compress — a deterministic, prompt-cache-friendly
messages-in / messages-out contract. It is the same compression you get
from the published SDK, so you can compress a chat array yourself before it reaches any model:
# pip install lean-ctx-sdk
from lean_ctx import compress
messages = compress(messages, model="claude-sonnet-4") Do I need a separate request-compression proxy?
No. A standalone request-compression proxy and the LeanCTX proxy occupy the same plane — the wire
path — and LeanCTX already ships it, prompt-cache-safe and metered. Run
lean-ctx proxy enable and you have read-side and wire-side compression from one local
binary, with one verifiable savings ledger across both.
Where to go next
- Token Savings — read-side vs. wire-side, and where the tokens go.
- Configuration → Proxy — enable/disable, port and per-provider upstreams.
- API Reference —
/v1/compressand the SDK surface. - CLI Reference —
lean-ctx proxysubcommands.