Journeys
Journey: Prove It: Savings, Quality & Benchmarks
Every serious adopter asks three questions: how much did it save, is the output still as good, and can I reproduce your numbers? One journey answers all three — a signed savings receipt from the tamper-evident ledger, a deterministic with/without A/B eval, and a self-verifying benchmark scorecard. All Ed25519-signed, all verifiable offline.
You areproving the payoff to a lead, client or procurement
savings signsavings verify-batchsavings roieval initeval abeval verify--gatebenchmark scorecarddeterminism digest
Every serious adopter asks three questions, in this order: how much did it actually save? — is the output still as good? — and can I reproduce your numbers myself? LeanCTX answers all three with the same discipline: deterministic measurements, objective scorers and Ed25519-signed artifacts anyone can verify offline. Savings are the receipt; this journey is the full evidence kit.
The three proofs at a glance
| Question | Proof | Command |
|---|---|---|
| How much did it save? | Signed savings receipt from a tamper-evident ledger | lean-ctx savings sign |
| Is quality the same or better? | Deterministic with/without A/B eval, signed verdict | lean-ctx eval ab |
| Are the headline numbers real? | Self-verifying benchmark scorecard | lean-ctx benchmark scorecard |
All three artifacts share the same trust model: an append-only SHA-256 chain or determinism digest for integrity, and an Ed25519 signature for origin. Verification needs no network, no ledger access and no LeanCTX history — just the file.
1. The savings receipt — sign the ledger
Every compression event is appended to the savings ledger at ~/.lean-ctx/savings/. Each entry commits the hash of the previous one, so editing, reordering, inserting or deleting any past event breaks the chain:
lean-ctx savings verify # is the local chain intact?
lean-ctx savings summary # net tokens, USD, by-model / by-tool breakdown
Verified Savings Ledger (local, auditable)
Net saved: 12.8M tokens (~$32.41) over 1,240 events
By model: claude-opus 8.1M · gpt-4o 3.0M · …
By tool: ctx_read 6.4M · ctx_search 3.1M · ctx_shell 2.0M · …
When someone — a lead, a client, a finance team — wants the payoff on paper, sign the aggregate totals plus the chain head with your machine’s persistent Ed25519 key:
lean-ctx savings sign --out ./sprint-savings.json
The recipient verifies it on their own machine, offline:
lean-ctx savings verify-batch ./sprint-savings.json
Signed savings batch: VALID
Signed by: 7b1e90…c4d2
Net saved: 12.8M tokens (~$32.41) over 1,240 event(s)
Chain head: 9f2c4b…e1a7
Tamper with anything — inflate the tokens, swap the public key, rewrite the chain head — and verification fails.
What’s shared — and what never is
The artifact carries only what an auditor needs. Signing is aggregate-only by construction: the payload is a dedicated struct, so a private field cannot accidentally be serialized into a shared file.
| Included | Never included |
|---|---|
| Net tokens, $ saved, event count | Raw events |
| Top by-model / by-tool rows | File names or paths |
Chain head (last_entry_hash) | Source code or prompts |
created_at, lean_ctx_version | Command contents |
| Ed25519 public key + signature | Per-event timestamps |
For a team-wide, procurement-grade number, lean-ctx savings roi distills the same ledger into net tokens, USD and top tools — signed the same way. Every command and field is documented in the Savings Ledger concept doc.
2. The quality proof — with vs. without, deterministically
“Does the agent answer better with LeanCTX, or just cheaper?” Token savings are easy to measure; output quality is the part people assume can never be pinned down. The A/B eval answers it head-on: run the same tasks through the same pinned model under two context conditions — raw vs LeanCTX — score objectively, and emit a signed, reproducible verdict.
The trick: separate the two sources of variance
| Layer | Status in the eval | How it’s controlled |
|---|---|---|
| Context | Fully deterministic | Both windows are assembled byte-for-byte the same way and digested (SHA-256) |
| Model | The only stochastic part | Pinned (temperature = 0, fixed seed) and, for CI, replayed from recorded real responses |
Every task runs twice under the same token budget, so any quality difference is about what went into the window, not how much:
- Baseline (“without”) — raw files in deterministic path order, packed until the budget is full. The naive “dump the repo into the prompt” approach.
- LeanCTX (“with”) — the task query drives BM25 relevance ranking, then each file is compressed so far more relevant signal fits in the identical budget.
This is the core claim made testable: with the same number of tokens, does retrieving + compressing beat dumping? The eval doesn’t assert it — it measures it.
Objective scorers, no vibes
No LLM-as-judge. Code tasks are scored by running the task’s unit tests in a sandbox (exit 0 = pass). RAG/QA tasks use SQuAD-style exact-match, token-overlap F1 and containment against gold answers — the same number every time for the same answer.
Run it in three commands
lean-ctx eval init eval-suite # scaffold a runnable starter suite
export LEAN_CTX_EVAL_MODEL_URL="https://api.openai.com/v1"
export LEAN_CTX_EVAL_MODEL="gpt-4o-mini"
export LEAN_CTX_EVAL_MODEL_KEY="sk-…"
lean-ctx eval ab \
--suite eval-suite/suite.ndjson \
--record eval-suite/recording.json \
--out ab-report.json
Mean score baseline=0.250 lean-ctx=0.875 Δ=+0.625
Pass rate baseline=0% lean-ctx=100%
Δ 95% CI [+0.375, +0.812] (2000 bootstrap, seed 0x5eed5eed5eed5eed)
Win/Tie/Loss 2 / 0 / 0
VERDICT: IMPROVED
determinism digest: 9f2c…
The verdict and the CI gate
A deterministic bootstrap (fixed seed → byte-identical CI everywhere) produces a 95% confidence interval on the mean delta, which collapses to IMPROVED, NO REGRESSION or REGRESSED. Add --gate and the command exits non-zero on a regression — a quality non-regression gate, the symmetric twin of the savings story.
For CI you don’t re-roll a paid API on every push: capture the model’s answers once (--record), commit the recording, then replay byte-for-byte, offline, no secrets:
lean-ctx eval ab --suite eval-suite/suite.ndjson --replay eval-suite/recording.json --gate
A missing recorded response is a hard error, never a silent fallback — that’s what guarantees the run (and its digest) is identical on every machine. And like every proof here, the report verifies offline:
lean-ctx eval verify ab-report.json
3. The scorecard — reproduce the headline numbers
Marketing numbers don’t survive procurement. The scorecard turns compression savings and retrieval quality into a measurement you can re-run and get the same answer, on your laptop and in CI:
lean-ctx benchmark scorecard # human-readable
lean-ctx benchmark scorecard --json # machine-readable artifact
You get per-mode compression savings, retrieval recall@5 / recall@10 / MRR and latency over a fixed scenario matrix, plus a determinism_digest:
{
"schema_version": 1,
"determinism_digest": "…", // fingerprint of the latency-free metrics
"scenarios": [ /* per-scenario savings + recall + mrr */ ],
"aggregate": { "avg_savings_pct": …, "avg_recall_at_5": …, "avg_mrr": … }
}
The corpus is generated deterministically and retrieval is pure BM25, so the quality metrics are identical run-to-run and machine-to-machine. Latency is reported but deliberately excluded from the digest (it’s wall-clock). Two runs of the same code anywhere produce the same digest — trust by construction, not by claim.
When to reach for which proof
- Justify the tool to a lead or finance — sign the ledger (§1); a signed dollar figure beats an unverifiable claim.
- Convince an engineering lead quality won’t drop — run the A/B eval (§2) and put
--gatein CI. - Answer procurement or research diligence — hand over the scorecard (§3) and let them reproduce the digest.
- Personal record — snapshot a signed receipt each quarter and keep a verifiable savings history.