Journeys

Journey: Prove It: Savings, Quality & Benchmarks

Every serious adopter asks three questions: how much did it save, is the output still as good, and can I reproduce your numbers? One journey answers all three — a signed savings receipt from the tamper-evident ledger, a deterministic with/without A/B eval, and a self-verifying benchmark scorecard. All Ed25519-signed, all verifiable offline.

Every serious adopter asks three questions, in this order: how much did it actually save?is the output still as good?and can I reproduce your numbers myself? LeanCTX answers all three with the same discipline: deterministic measurements, objective scorers and Ed25519-signed artifacts anyone can verify offline. Savings are the receipt; this journey is the full evidence kit.

The three proofs at a glance

QuestionProofCommand
How much did it save?Signed savings receipt from a tamper-evident ledgerlean-ctx savings sign
Is quality the same or better?Deterministic with/without A/B eval, signed verdictlean-ctx eval ab
Are the headline numbers real?Self-verifying benchmark scorecardlean-ctx benchmark scorecard

All three artifacts share the same trust model: an append-only SHA-256 chain or determinism digest for integrity, and an Ed25519 signature for origin. Verification needs no network, no ledger access and no LeanCTX history — just the file.

1. The savings receipt — sign the ledger

Every compression event is appended to the savings ledger at ~/.lean-ctx/savings/. Each entry commits the hash of the previous one, so editing, reordering, inserting or deleting any past event breaks the chain:

lean-ctx savings verify          # is the local chain intact?
lean-ctx savings summary         # net tokens, USD, by-model / by-tool breakdown
Verified Savings Ledger (local, auditable)
  Net saved:  12.8M tokens  (~$32.41)  over 1,240 events
  By model:   claude-opus 8.1M · gpt-4o 3.0M · …
  By tool:    ctx_read 6.4M · ctx_search 3.1M · ctx_shell 2.0M · …

When someone — a lead, a client, a finance team — wants the payoff on paper, sign the aggregate totals plus the chain head with your machine’s persistent Ed25519 key:

lean-ctx savings sign --out ./sprint-savings.json

The recipient verifies it on their own machine, offline:

lean-ctx savings verify-batch ./sprint-savings.json
Signed savings batch: VALID
  Signed by:  7b1e90…c4d2
  Net saved:  12.8M tokens (~$32.41) over 1,240 event(s)
  Chain head: 9f2c4b…e1a7

Tamper with anything — inflate the tokens, swap the public key, rewrite the chain head — and verification fails.

What’s shared — and what never is

The artifact carries only what an auditor needs. Signing is aggregate-only by construction: the payload is a dedicated struct, so a private field cannot accidentally be serialized into a shared file.

IncludedNever included
Net tokens, $ saved, event countRaw events
Top by-model / by-tool rowsFile names or paths
Chain head (last_entry_hash)Source code or prompts
created_at, lean_ctx_versionCommand contents
Ed25519 public key + signaturePer-event timestamps

For a team-wide, procurement-grade number, lean-ctx savings roi distills the same ledger into net tokens, USD and top tools — signed the same way. Every command and field is documented in the Savings Ledger concept doc.

2. The quality proof — with vs. without, deterministically

“Does the agent answer better with LeanCTX, or just cheaper?” Token savings are easy to measure; output quality is the part people assume can never be pinned down. The A/B eval answers it head-on: run the same tasks through the same pinned model under two context conditions — raw vs LeanCTX — score objectively, and emit a signed, reproducible verdict.

The trick: separate the two sources of variance

LayerStatus in the evalHow it’s controlled
ContextFully deterministicBoth windows are assembled byte-for-byte the same way and digested (SHA-256)
ModelThe only stochastic partPinned (temperature = 0, fixed seed) and, for CI, replayed from recorded real responses

Every task runs twice under the same token budget, so any quality difference is about what went into the window, not how much:

  • Baseline (“without”) — raw files in deterministic path order, packed until the budget is full. The naive “dump the repo into the prompt” approach.
  • LeanCTX (“with”) — the task query drives BM25 relevance ranking, then each file is compressed so far more relevant signal fits in the identical budget.

This is the core claim made testable: with the same number of tokens, does retrieving + compressing beat dumping? The eval doesn’t assert it — it measures it.

Objective scorers, no vibes

No LLM-as-judge. Code tasks are scored by running the task’s unit tests in a sandbox (exit 0 = pass). RAG/QA tasks use SQuAD-style exact-match, token-overlap F1 and containment against gold answers — the same number every time for the same answer.

Run it in three commands

lean-ctx eval init eval-suite      # scaffold a runnable starter suite

export LEAN_CTX_EVAL_MODEL_URL="https://api.openai.com/v1"
export LEAN_CTX_EVAL_MODEL="gpt-4o-mini"
export LEAN_CTX_EVAL_MODEL_KEY="sk-…"

lean-ctx eval ab \
  --suite eval-suite/suite.ndjson \
  --record eval-suite/recording.json \
  --out ab-report.json
Mean score   baseline=0.250  lean-ctx=0.875  Δ=+0.625
Pass rate    baseline=0%     lean-ctx=100%
Δ 95% CI     [+0.375, +0.812]  (2000 bootstrap, seed 0x5eed5eed5eed5eed)
Win/Tie/Loss 2 / 0 / 0

VERDICT: IMPROVED
determinism digest: 9f2c…

The verdict and the CI gate

A deterministic bootstrap (fixed seed → byte-identical CI everywhere) produces a 95% confidence interval on the mean delta, which collapses to IMPROVED, NO REGRESSION or REGRESSED. Add --gate and the command exits non-zero on a regression — a quality non-regression gate, the symmetric twin of the savings story.

For CI you don’t re-roll a paid API on every push: capture the model’s answers once (--record), commit the recording, then replay byte-for-byte, offline, no secrets:

lean-ctx eval ab --suite eval-suite/suite.ndjson --replay eval-suite/recording.json --gate

A missing recorded response is a hard error, never a silent fallback — that’s what guarantees the run (and its digest) is identical on every machine. And like every proof here, the report verifies offline:

lean-ctx eval verify ab-report.json

3. The scorecard — reproduce the headline numbers

Marketing numbers don’t survive procurement. The scorecard turns compression savings and retrieval quality into a measurement you can re-run and get the same answer, on your laptop and in CI:

lean-ctx benchmark scorecard          # human-readable
lean-ctx benchmark scorecard --json   # machine-readable artifact

You get per-mode compression savings, retrieval recall@5 / recall@10 / MRR and latency over a fixed scenario matrix, plus a determinism_digest:

{
  "schema_version": 1,
  "determinism_digest": "…",   // fingerprint of the latency-free metrics
  "scenarios": [ /* per-scenario savings + recall + mrr */ ],
  "aggregate": { "avg_savings_pct": …, "avg_recall_at_5": …, "avg_mrr": … }
}

The corpus is generated deterministically and retrieval is pure BM25, so the quality metrics are identical run-to-run and machine-to-machine. Latency is reported but deliberately excluded from the digest (it’s wall-clock). Two runs of the same code anywhere produce the same digest — trust by construction, not by claim.

When to reach for which proof

  • Justify the tool to a lead or finance — sign the ledger (§1); a signed dollar figure beats an unverifiable claim.
  • Convince an engineering lead quality won’t drop — run the A/B eval (§2) and put --gate in CI.
  • Answer procurement or research diligence — hand over the scorecard (§3) and let them reproduce the digest.
  • Personal record — snapshot a signed receipt each quarter and keep a verifiable savings history.