Journeys

Journey: Prove It: Savings, Quality & Benchmarks

Every serious adopter asks three questions: how much did it save, is the output still as good, and can I reproduce your numbers? One journey answers all three — a signed savings receipt from the tamper-evident ledger, a deterministic with/without A/B eval, and a self-verifying benchmark scorecard. All Ed25519-signed, all verifiable offline.

Journey 22Operate & GovernGovernPerceive

You areproving the payoff to a lead, client or procurement

Covers

savings sign
savings verify-batch
savings roi
eval init
eval ab
eval verify
--gate
benchmark scorecard
determinism digest

Every serious adopter asks three questions, in this order: how much did it actually save? — is the output still as good? — and can I reproduce your numbers myself? LeanCTX answers all three with the same discipline: deterministic measurements, objective scorers and Ed25519-signed artifacts anyone can verify offline. Then a fourth question arrives with procurement — can you hand an auditor one signed, offline-verifiable bundle? — and §4 answers that too. Savings are the receipt; this journey is the full evidence kit.

The proofs at a glance

Question	Proof	Command
How much did it save?	Signed savings receipt from a tamper-evident ledger	`lean-ctx savings sign`
Is quality the same or better?	Deterministic with/without A/B eval, signed verdict	`lean-ctx eval ab`
Are the headline numbers real?	Self-verifying benchmark scorecard	`lean-ctx benchmark scorecard`
Can an auditor verify it end-to-end?	Signed Evidence Bundle, replayed offline by a standalone verifier	`lean-ctx audit evidence`

All four artifacts share the same trust model: an append-only SHA-256 chain or determinism digest for integrity, and an Ed25519 signature for origin. Verification needs no network, no ledger access and no LeanCTX history — just the file.

1. The savings receipt — sign the ledger

Every compression event is appended to the savings ledger at ~/.local/share/lean-ctx/savings/. Each entry commits the hash of the previous one, so editing, reordering, inserting or deleting any past event breaks the chain:

lean-ctx savings verify          # is the local chain intact?
lean-ctx savings summary         # net tokens, USD, by-model / by-tool breakdown

Verified Savings Ledger (local, auditable)
  Net saved:  12.8M tokens  (~$32.41)  over 1,240 events
  By model:   claude-opus 8.1M · gpt-4o 3.0M · …
  By tool:    ctx_read 6.4M · ctx_search 3.1M · ctx_shell 2.0M · …

When someone — a lead, a client, a finance team — wants the payoff on paper, sign the aggregate totals plus the chain head with your machine’s persistent Ed25519 key:

lean-ctx savings sign --out ./sprint-savings.json

The recipient verifies it on their own machine, offline:

lean-ctx savings verify-batch ./sprint-savings.json

Signed savings batch: VALID
  Signed by:  7b1e90…c4d2
  Net saved:  12.8M tokens (~$32.41) over 1,240 event(s)
  Chain head: 9f2c4b…e1a7

Tamper with anything — inflate the tokens, swap the public key, rewrite the chain head — and verification fails.

What’s shared — and what never is

The artifact carries only what an auditor needs. Signing is aggregate-only by construction: the payload is a dedicated struct, so a private field cannot accidentally be serialized into a shared file.

Included	Never included
Net tokens, $ saved, event count	Raw events
Top by-model / by-tool rows	File names or paths
Chain head (`last_entry_hash`)	Source code or prompts
`created_at`, `lean_ctx_version`	Command contents
Ed25519 public key + signature	Per-event timestamps

For a team-wide, procurement-grade number, lean-ctx savings roi distills the same ledger into net tokens, USD and top tools — signed the same way. Every command and field is documented in the Savings Ledger concept doc.

2. The quality proof — with vs. without, deterministically

“Does the agent answer better with LeanCTX, or just cheaper?” Token savings are easy to measure; output quality is the part people assume can never be pinned down. The A/B eval answers it head-on: run the same tasks through the same pinned model under two context conditions — raw vs LeanCTX — score objectively, and emit a signed, reproducible verdict.

The trick: separate the two sources of variance

Layer	Status in the eval	How it’s controlled
Context	Fully deterministic	Both windows are assembled byte-for-byte the same way and digested (SHA-256)
Model	The only stochastic part	Pinned (`temperature = 0`, fixed `seed`) and, for CI, replayed from recorded real responses

Every task runs twice under the same token budget, so any quality difference is about what went into the window, not how much:

Baseline (“without”) — raw files in deterministic path order, packed until the budget is full. The naive “dump the repo into the prompt” approach.
LeanCTX (“with”) — the task query drives BM25 relevance ranking, then each file is compressed so far more relevant signal fits in the identical budget.

This is the core claim made testable: with the same number of tokens, does retrieving + compressing beat dumping? The eval doesn’t assert it — it measures it.

Objective scorers, no vibes

No LLM-as-judge. Code tasks are scored by running the task’s unit tests in a sandbox (exit 0 = pass). RAG/QA tasks use SQuAD-style exact-match, token-overlap F1 and containment against gold answers — the same number every time for the same answer.

Run it in three commands

lean-ctx eval init eval-suite      # scaffold a runnable starter suite

export LEAN_CTX_EVAL_MODEL_URL="https://api.openai.com/v1"
export LEAN_CTX_EVAL_MODEL="gpt-4o-mini"
export LEAN_CTX_EVAL_MODEL_KEY="sk-…"

lean-ctx eval ab \
  --suite eval-suite/suite.ndjson \
  --record eval-suite/recording.json \
  --out ab-report.json

Mean score   baseline=0.250  lean-ctx=0.875  Δ=+0.625
Pass rate    baseline=0%     lean-ctx=100%
Δ 95% CI     [+0.375, +0.812]  (2000 bootstrap, seed 0x5eed5eed5eed5eed)
Win/Tie/Loss 2 / 0 / 0

VERDICT: IMPROVED
determinism digest: 9f2c…

The verdict and the CI gate

A deterministic bootstrap (fixed seed → byte-identical CI everywhere) produces a 95% confidence interval on the mean delta, which collapses to IMPROVED, NO REGRESSION or REGRESSED. Add --gate and the command exits non-zero on a regression — a quality non-regression gate, the symmetric twin of the savings story.

For CI you don’t re-roll a paid API on every push: capture the model’s answers once (--record), commit the recording, then replay byte-for-byte, offline, no secrets:

lean-ctx eval ab --suite eval-suite/suite.ndjson --replay eval-suite/recording.json --gate

A missing recorded response is a hard error, never a silent fallback — that’s what guarantees the run (and its digest) is identical on every machine. And like every proof here, the report verifies offline:

lean-ctx eval verify ab-report.json

3. The scorecard — reproduce the headline numbers

Marketing numbers don’t survive procurement. The scorecard turns compression savings and retrieval quality into a measurement you can re-run and get the same answer, on your laptop and in CI:

lean-ctx benchmark scorecard          # human-readable
lean-ctx benchmark scorecard --json   # machine-readable artifact

You get per-mode compression savings, retrieval recall@5 / recall@10 / MRR and latency over a fixed scenario matrix, plus a determinism_digest:

{
  "schema_version": 1,
  "determinism_digest": "…",   // fingerprint of the latency-free metrics
  "scenarios": [ /* per-scenario savings + recall + mrr */ ],
  "aggregate": { "avg_savings_pct": …, "avg_recall_at_5": …, "avg_mrr": … }
}

The corpus is generated deterministically and retrieval is pure BM25, so the quality metrics are identical run-to-run and machine-to-machine. Latency is reported but deliberately excluded from the digest (it’s wall-clock). Two runs of the same code anywhere produce the same digest — trust by construction, not by claim.

4. The evidence bundle — everything an auditor needs

The three proofs above each answer one question. When an auditor, a customer’s security team or an EU AI Act reviewer wants all of it in one place, export a single signed bundle for a time window:

lean-ctx audit evidence --from 2026-01-01 --to 2026-03-31 --out evidence.zip          # deterministic ZIP
lean-ctx audit evidence --from 2026-01-01 --to 2026-03-31 --framework eu-ai-act --out evidence.zip

The deterministic ZIP (Evidence Bundle v1) carries the audit-chain segment for the period, the resolved context policy pack, the Context Governance Benchmark (CGB) and framework-coverage reports (EU AI Act / ISO 42001 / SOC 2), and an Ed25519-signed manifest binding it all together.

Verification needs none of LeanCTX — a standalone verifier (leanctx-verify, no engine code, no network) replays the hash chain and checks every signature:

leanctx-verify ./evidence.zip

Evidence Bundle v1 — VERIFICATION
  [1/5] manifest signature ........ PASS
  [2/5] audit hash chain .......... PASS  (1,240 entries)
  [3/5] policy pack digest ........ PASS
  [4/5] framework coverage ........ PASS  (EU AI Act, ISO 42001, SOC 2)
  [5/5] bundle integrity .......... PASS

VERDICT: VALID

Five PASS/FAIL checks, offline, by anyone — the same trust-by-construction model as the receipt, the eval and the scorecard, scaled to a full compliance record. The buyer-facing funnel lives on the compliance page and the Context Governance Benchmark.

When to reach for which proof

Justify the tool to a lead or finance — sign the ledger (§1); a signed dollar figure beats an unverifiable claim.
Convince an engineering lead quality won’t drop — run the A/B eval (§2) and put --gate in CI.
Answer procurement or research diligence — hand over the scorecard (§3) and let them reproduce the digest.
Hand an auditor the whole record — export the signed evidence bundle (§4); leanctx-verify validates it offline, no LeanCTX required.
Personal record — snapshot a signed receipt each quarter and keep a verifiable savings history.