Feature exploration memo · 2026-04-21

What does it actually take to ship an LLM review layer?

TL;DR.I tested a small AI reviewer the way you might test a new employee: first with no handbook, then with the right handbook pages, then with a rule checker beside it, then after a short training round. The best version developed could say, in effect: “Here is the policy I used, here is the risk I see, and here is why this ticket should or should not be approved.”

Winner ModelLlama 3.1 8b-instruct trained with LoRA · 95.3% decision accuracy · no unsafe approvals · all 40 safe approvals kept · no broken JSON · no mystery citations.

Eval size
150 tickets
Model families
3
R3 LoRa training rows
200
Cost / 1k reviews
~$0.13

1. Question

For this exploration, the practical question was that if I put an LLM reviewer in front of support replies, what has to support at mimimum it so it behaves like a useful teammate instead of a very confident autocomplete box?

Why now

Two easy answers usually appear when discussing solutions with colleages: “give it documents” or “fine-tune it.” But those are names, not plans. So I built a small environments where I could watch each idea work on a product. NimbusMail has realistic support rules: refunds, billing disputes, deliverability problems, account deletion, and countries where service is restricted.

While review-screen demo is the visible artifact. The real product here is the measurement trail of approaches used which could hopefully provide a way to make the same decisions again for a different workflow.

2. Experiment setup

I chose Tinker Labs to provide model inference and training and settled on 3 models due to their parameter count and performance. In LLM work, an eval is just a test set. “Locked evals” means I do not keep changing the test until the model looks good. So in order measure our LLMs's work, I generated a 150-ticket exam set based off the policy/KB and locked it before looking at the final answers. The exam was balanced across four support decisions: approve, edit, escalate, or reject. Then I decided on four test environments for our LLMs: R0 is where LLMs gets only the ticket, R1 gets relevant policy snippets, R2 adds a hard rule checker, and R3 adds a small trained adapter called LoRA. Think of LoRA as a clip-on memory layer: it nudges the model toward the behavior it practiced earlier without rebuilding the whole model.

R0

No evidence

Ticket only. What does the model do with its prior?

Llama41.6% acc · 4 approvals
R1 retrieved

Retrieved evidence

Ticket plus top-K NimbusMail policy and KB chunks.

Llama66.0% acc · 42 approvals · 3.3% unsafe
R1 random

Random evidence control

Same prompt, wrong chunks. Grounding sanity check.

Llama55.3% acc · 4.0% unsafe
R2 gate

Deterministic gate

Policy/citation/risk verifier on top of R1 outputs.

Llama44.7% acc · 0 approvals · 0% unsafe
R3 LoRA

Process-supervised LoRA

All-linear LoRA trained on decisions + claim spans + citations + risk flags.

Llama95.3% acc · 40 approvals · 0% unsafe
Data construction notes

Retrieval means fetching the likely-relevant policy pages before the model answers, the way a support rep might open the handbook. On this eval, retrieval usually found the right page: 0.89 mean gold-chunk coverage and 0.97 any coverage. The labels have an internal audit today; the external human audit is accepted for project use, with raw annotator files still to import.

The first test had a hidden flaw: it contained zero approve-labelled rows. That made “never approve anything” look smart. Replacing that test was the biggest quality lift in the project.

3. Findings

R0 — no evidence

R0 is the model walking into the exam with no handbook. It ended up looking safe because it barely says yes. Llama made only 4 approvals out of the 40 tickets that were safe to approve; Qwen and Nemotron make zero. This is known as conservative collapse: where the LLM tries to avoid danger by refusing to help.

Conservative collapse, illustrated

40 safe-to-approve tickets live in a balanced eval set. The goal is isn't “low unsafe approval.” But to keep safe approvals while blocking unsafe ones.

R0 no evidence: 4 of 40 safe approvals preserved. Approves almost nothing → unsafe = 0% but useful = 0%.

R1 — retrieved evidence (with a random-evidence control)

R1 gives the model retrieved evidence, meaning the ticket plus the handbook pages (in embedded chunks) that seem relevant. For Llama, accuracy rose to 66%, and approvals return to a manageable territory. I also tried giving random policy snippets which made performance dropped by more than 10 points. This tells tells us the content of the evidence mattered, not just the structure of a citation block.

R2 — deterministic gate

R2 adds a deterministic gate (just means it follows fixed rules). If a reply lacks a citation, trips a risk flag, or conflicts with policy, it's blocked. This is useful but here it became too strict. It removed unsafe approvals by blocking every approval, including the good ones. A gate can help catch clear violations but often cannot turn a confused answer into a good one.

Method × Model matrix

All rows on the 150-row balanced locked eval. Switch metric to see how each method ladder differs by family.

R3 — process-supervised LoRA

R3 is where the system started to look like a trained reviewer. I did not only train it on final answers but on process as well. Like, which decision to make, which claim was supported by policy, which citation belonged to that claim, and which risks should block approval. Llama in R3 reached 95.3% accuracy, kept all 40 safe approvals, made no unsafe approvals, and returned clean JSON every time. Qwen showed a strong research signal but broke JSON on 7 rows. Unfortunately a broken output is still broken product.

Train → dev → eval: one direction only

Checkpoints selected on dev. The locked 150-row eval is touched once per winner, so tuning pressure never leaks into the headline numbers.

Train200 rowsTinker LoRA SFTDev50 rowsDev product metricsLocked eval150 rowsFinal matrix
Train

Process-supervised targets: decision + reply + claim spans + citations + risk flags.

Dev

Checkpoint are selected on dev product metrics, not loss. Epoch 1–8 scored, best kept.

Locked eval

Balanced decision mix (40/40/45/25). Touched once per selected winner.

Safety vs usefulness

Top-left is the product goal: high decision accuracy, low unsafe approval. R3 LoRA points cluster in the top-left corner; R1 without a gate wanders right; R2 drags down usefulness.

LlamaQwenNemotronGreen zone = product target

Paired bootstrap, 5000 resamples

Decision-accuracy improvement with 95% CI. Positive = candidate wins. Dots right of 0 = reliably better.

Δ = 0Llama R1 retrieved vs R0+24.2ppLlama R1 retrieved vs R1 random+10.7ppLlama R2 gate vs R1 retrieved-21.3ppLlama R3 LoRA vs R1 retrieved+29.3ppQwen R3 LoRA vs R1 retrieved+57.3ppNemotron R3 LoRA vs R1 retrieved+41.2pp-70pp0+70pp

Final method matrix (static)

Same results as the interactive view, frozen as a paper-style reference.

Final method matrix
Full matrix table (15 rows)
ModelMethodParsedAccuracyUnsafeApprovalsOOB citesErrors
LlamaR0149/15041.6%0.0%401
LlamaR1 retrieved150/15066.0%3.3%4220
LlamaR1 random150/15055.3%4.0%3230
LlamaR2 gate150/15044.7%0.0%020
LlamaR3 LoRA150/15095.3%0.0%4000
QwenR0149/15026.8%0.0%011
QwenR1 retrieved150/15040.7%30.0%8210
QwenR1 random150/15038.7%14.7%48370
QwenR2 gate150/15032.0%22.0%5210
QwenR3 LoRA143/150100.0% parsed0.0%4007
NemotronR0132/15031.8%0.0%015518
NemotronR1 retrieved149/15048.3%0.7%33201
NemotronR1 random150/15038.0%6.0%31410
NemotronR2 gate149/15034.9%0.0%12201
NemotronR3 LoRA149/15089.9% parsed0.0%4001

4. Surprises

Three things changed the plan. 1) the original eval rewarded doing nothing, so I had to rebuild. 2) more training was not automatically better. A checkpoint is a saved version of the model after some training; Llama’s second checkpoint ended up the best best. Later checkpoints kept studying the examples but drifted toward unsafe approvals. 3) output format matters. If the app expects JSON and the model returns something else often after sevearl evals. It is a broken handoff.

Checkpoint drift across 8 epochs

Dev accuracy plateaus early. Unsafe approval creeps up in later epochs. We selected on product metrics — not the loss curve or final epoch.

R3 training loss, all three families

8 epochs × 13 batches = 104 optimizer steps. Loss drops cleanly but that is not what selected the winner — dev-set product metrics did.

Training loss is the model’s practice-score while it studies. It tells us whether training is happening; it does not tell us whether customers are safer. I picked checkpoints on a separate 50-ticket dev set using product metrics: accuracy, unsafe approval, useful approvals, parse errors, and citations that stayed inside the provided evidence.

5. Production implications

On Tinker, Llama R3 costed about 0.13¢ per review, and retraining the LoRA adapter is roughly $0.60 per refresh. Even at 1M reviews/month, the model bill would sit in the low hundreds. The real cost is mantainance: updating policy chunks, refreshing the eval when support policies change, re-auditing labels, and deciding who owns an unsafe approval.

Cost at scale, per method

Tinker prefill/sample/train rates from docs. Other platforms from public pricing. Token shape: R0 ticket only (~1.2k in / 0.5k out); R1/R2/R3 + retrieved evidence (~2.8k in / 0.6k out). Numbers are list-price orientation, not quotes.

1k10k100k1000k
$22
R1 ≈ 0.06¢/review · R3 ≈ 0.06¢/review
Assumptions and what changes the answer
  • Token shape from V2 eval runs: R1/R2/R3 all ship ~2.8k evidence-loaded input tokens; R0 is ~1.2k.
  • R3 training: 104 optimizer steps × ~15k train tokens/step ≈ 1.56M tokens per retrain.
  • MoE pricing: Qwen/Nemotron MoE cheaper on Tinker because priced by active parameters, not total.
  • Other platforms: list-price blends for 8B/30B-class models, not LoRA-adapter-specific SKUs. Most managed platforms charge extra to serve a fine-tuned adapter.
  • Retrieval cost: not included — NimbusMail has ~2k chunks; a managed index runs $0–50/mo at this scale.
  • Human review: out of scope. This priced only the LLM layer.

Platform options and list prices (orientation only)

PlatformLlama-8B in/outQwen-30B in/outTraining / adapter note
Tinker (ours)$0.13 / $0.40$0.12 / $0.30LoRA train $0.36–0.40/1M tokens
Fireworks$0.20 / $0.20$0.15 / $0.60LoRA SFT $0.50/1M; H100 $6/hr
Together$0.18 / $0.18LoRA SFT 17–69B $1.50/1M
DeepInfra$0.02 / $0.05$0.08 / $0.28No managed FT
Modal + vLLMRaw GPU: H100 $3.95/hr

All prices are per 1M tokens, using list prices from 2026-04. They are orientation numbers, not procurement quotes. Source links are in the appendix.

Maintenance cost at three volumes

RungOne-time workRecurringWhat triggers re-work
R0Prompt + schemaNear-zeroModel swap
R1Policy/KB chunking + retrieval evalChunk refresh on policy changePolicy updates, KB drift
R2Deterministic rules (citation, risk, policy)Rule audit quarterlyNew risk categories, rule false positives
R3Process-supervised training data (~200 rows)LoRA retrain per policy shift; dev eval per retrainPolicy changes, label drift, eval rotation

6. Recommendation

In this short set of experiments, Llama R3 stood our, keeping R1 as the simpler fallback. But I would not treat this demo as proof on its own. Before making external claims, the eval still needs a full human audit. Still, I beleive it makes sense to add complexity only when the simpler layer has failed in a measurable way - which we have tested.

What I’d take into the next LLM system

  1. Build the test before the method. Balance it across every decision the system must learn.
  2. Measure safety and usefulness together. “Never says yes” can look safe while being useless.
  3. Use retrieval when the model needs facts. Test random evidence too, so you know whether the facts actually helped.
  4. Use gates to catch clear rule breaks, not to rescue bad reasoning.
  5. Choose checkpoints by product behavior, not by the prettiest training-loss curve.
  6. Treat citations, parse reliability, and maintenance as product work, not backend chores.

Appendix

R3 training config
  • 200 training rows / 50 dev rows / 150 locked-eval rows
  • 8 epochs, 104 optimizer steps, batch size 16
  • All-linear LoRA target modules, rank 16, blocker weight
  • Training on Tinker; checkpoints stored per epoch; selected on dev product metrics
  • Process-supervised targets include decision, reply, claim-level support labels, citations, risk flags
Tinker models in use
FamilyTinker IDPrefillSampleTrain
LlamaLlama-3.1-8B-Instruct$0.13$0.40$0.40
QwenQwen3-30B-A3B-Instruct-2507$0.12$0.30$0.36
NemotronNemotron-3-Nano-30B-A3B$0.13$0.33$0.40

Per 1M tokens. Source: tinker-docs/models.

Confusion matrices (Llama, all 5 rungs)
Llama r0 confusion
r0
Llama r1-retrieved confusion
r1-retrieved
Llama r1-random confusion
r1-random
Llama r2-gate confusion
r2-gate
Llama r3-lora confusion
r3-lora
Limits and what would change the headline
  • Label audit. Provisional internal audit; external-human adjudication is accepted for project use but not imported into repo.
  • Eval size. 150 rows. Enough for an engineering signal, not an academic benchmark.
  • Parse reliability. Qwen R3 recovers to 150/150 with a parser-repair pass. Framed as “Qwen R3 + repair,” not raw R3.
  • R2 gate is intentionally simple and over-blocks. A claim-level verifier (R2.2) does better but still trails R3.
  • Synthetic data is source-grounded and leakage-audited, but synthetic nonetheless.
  • Cost model ignores retrieval infra and human review labor — both are small at this scale but real.
Source docs and artifacts
  • Final report: docs/final-ml-product-report-polished-2026-04-20.md
  • Rubric: docs/grounded-reply-eval-rubric.md
  • Decision log: docs/decision-log.md
  • Teaching phases: docs/teaching/phase-00 through phase-08
  • Final matrix CSV: evals/grounded-evalstats/v2-balanced-final-matrix.csv
  • Bootstrap summary: evals/grounded-evalstats/v2-balanced-bootstrap-summary.md
  • W&B run: nad707-self/relay/runs/wqtldjb5