Feature exploration memo · 2026-04-21

What does it actually take to ship an LLM review layer?

A exploratory AI support-reply prototype. NimbusMail is a fictional B2B SaaS company with a synthetically generated and validated support policy and knowledge base. I gave three model families the a 150 support ticket exam, then changed what help they could use. Wrote this to document my learning of popular techniques for LLM-based document usecases and hopefully for people deciding whether this kind of AI layer is worth real engineering time. UI Demo ↗ · Weights&Biases run ↗ · Github ↗

TL;DR.I tested a small AI reviewer the way you might test a new employee: first with no handbook, then with the right handbook pages, then with a rule checker beside it, then after a short training round. The best version developed could say, in effect: “Here is the policy I used, here is the risk I see, and here is why this ticket should or should not be approved.”

Winner ModelLlama 3.1 8b-instruct trained with LoRA · 95.3% decision accuracy · no unsafe approvals · all 40 safe approvals kept · no broken JSON · no mystery citations.

Eval size

150 tickets

Model families

R3 LoRa training rows

200

Cost / 1k reviews

~$0.13

1. Question

For this exploration, the practical question was that if I put an LLM reviewer in front of support replies, what has to support at mimimum it so it behaves like a useful teammate instead of a very confident autocomplete box?

Why now

Two easy answers usually appear when discussing solutions with colleages: “give it documents” or “fine-tune it.” But those are names, not plans. So I built a small environments where I could watch each idea work on a product. NimbusMail has realistic support rules: refunds, billing disputes, deliverability problems, account deletion, and countries where service is restricted.

While review-screen demo is the visible artifact. The real product here is the measurement trail of approaches used which could hopefully provide a way to make the same decisions again for a different workflow.

2. Experiment setup

I chose Tinker Labs to provide model inference and training and settled on 3 models due to their parameter count and performance. In LLM work, an eval is just a test set. “Locked evals” means I do not keep changing the test until the model looks good. So in order measure our LLMs's work, I generated a 150-ticket exam set based off the policy/KB and locked it before looking at the final answers. The exam was balanced across four support decisions: approve, edit, escalate, or reject. Then I decided on four test environments for our LLMs: R0 is where LLMs gets only the ticket, R1 gets relevant policy snippets, R2 adds a hard rule checker, and R3 adds a small trained adapter called LoRA. Think of LoRA as a clip-on memory layer: it nudges the model toward the behavior it practiced earlier without rebuilding the whole model.

No evidence

Ticket only. What does the model do with its prior?

Llama41.6% acc · 4 approvals

R1 retrieved

Retrieved evidence

Ticket plus top-K NimbusMail policy and KB chunks.

Llama66.0% acc · 42 approvals · 3.3% unsafe

R1 random

Random evidence control

Same prompt, wrong chunks. Grounding sanity check.

Llama55.3% acc · 4.0% unsafe

R2 gate

Deterministic gate

Policy/citation/risk verifier on top of R1 outputs.

Llama44.7% acc · 0 approvals · 0% unsafe

R3 LoRA

Process-supervised LoRA

All-linear LoRA trained on decisions + claim spans + citations + risk flags.

Llama95.3% acc · 40 approvals · 0% unsafe

Data construction notes

Retrieval means fetching the likely-relevant policy pages before the model answers, the way a support rep might open the handbook. On this eval, retrieval usually found the right page: 0.89 mean gold-chunk coverage and 0.97 any coverage. The labels have an internal audit today; the external human audit is accepted for project use, with raw annotator files still to import.

The first test had a hidden flaw: it contained zero approve-labelled rows. That made “never approve anything” look smart. Replacing that test was the biggest quality lift in the project.

3. Findings

R0 — no evidence

R0 is the model walking into the exam with no handbook. It ended up looking safe because it barely says yes. Llama made only 4 approvals out of the 40 tickets that were safe to approve; Qwen and Nemotron make zero. This is known as conservative collapse: where the LLM tries to avoid danger by refusing to help.

Conservative collapse, illustrated

40 safe-to-approve tickets live in a balanced eval set. The goal is isn't “low unsafe approval.” But to keep safe approvals while blocking unsafe ones.

R0 no evidence: 4 of 40 safe approvals preserved. Approves almost nothing → unsafe = 0% but useful = 0%.

R1 — retrieved evidence (with a random-evidence control)

R1 gives the model retrieved evidence, meaning the ticket plus the handbook pages (in embedded chunks) that seem relevant. For Llama, accuracy rose to 66%, and approvals return to a manageable territory. I also tried giving random policy snippets which made performance dropped by more than 10 points. This tells tells us the content of the evidence mattered, not just the structure of a citation block.

R2 — deterministic gate

R2 adds a deterministic gate (just means it follows fixed rules). If a reply lacks a citation, trips a risk flag, or conflicts with policy, it's blocked. This is useful but here it became too strict. It removed unsafe approvals by blocking every approval, including the good ones. A gate can help catch clear violations but often cannot turn a confused answer into a good one.

Method × Model matrix

All rows on the 150-row balanced locked eval. Switch metric to see how each method ladder differs by family.

Metric

Family

R3 — process-supervised LoRA

R3 is where the system started to look like a trained reviewer. I did not only train it on final answers but on process as well. Like, which decision to make, which claim was supported by policy, which citation belonged to that claim, and which risks should block approval. Llama in R3 reached 95.3% accuracy, kept all 40 safe approvals, made no unsafe approvals, and returned clean JSON every time. Qwen showed a strong research signal but broke JSON on 7 rows. Unfortunately a broken output is still broken product.

Train → dev → eval: one direction only

Checkpoints selected on dev. The locked 150-row eval is touched once per winner, so tuning pressure never leaks into the headline numbers.

Train

Process-supervised targets: decision + reply + claim spans + citations + risk flags.

Dev

Checkpoint are selected on dev product metrics, not loss. Epoch 1–8 scored, best kept.

Locked eval

Balanced decision mix (40/40/45/25). Touched once per selected winner.

Safety vs usefulness

Top-left is the product goal: high decision accuracy, low unsafe approval. R3 LoRA points cluster in the top-left corner; R1 without a gate wanders right; R2 drags down usefulness.

LlamaQwenNemotronGreen zone = product target

Paired bootstrap, 5000 resamples

Decision-accuracy improvement with 95% CI. Positive = candidate wins. Dots right of 0 = reliably better.

Final method matrix (static)

Same results as the interactive view, frozen as a paper-style reference.

Full matrix table (15 rows)

Model	Method	Parsed	Accuracy	Unsafe	Approvals	OOB cites	Errors
Llama	R0	149/150	41.6%	0.0%	4	0	1
Llama	R1 retrieved	150/150	66.0%	3.3%	42	2	0
Llama	R1 random	150/150	55.3%	4.0%	32	3	0
Llama	R2 gate	150/150	44.7%	0.0%	0	2	0
Llama	R3 LoRA	150/150	95.3%	0.0%	40	0	0
Qwen	R0	149/150	26.8%	0.0%	0	1	1
Qwen	R1 retrieved	150/150	40.7%	30.0%	82	1	0
Qwen	R1 random	150/150	38.7%	14.7%	48	37	0
Qwen	R2 gate	150/150	32.0%	22.0%	52	1	0
Qwen	R3 LoRA	143/150	100.0% parsed	0.0%	40	0	7
Nemotron	R0	132/150	31.8%	0.0%	0	155	18
Nemotron	R1 retrieved	149/150	48.3%	0.7%	33	20	1
Nemotron	R1 random	150/150	38.0%	6.0%	31	41	0
Nemotron	R2 gate	149/150	34.9%	0.0%	12	20	1
Nemotron	R3 LoRA	149/150	89.9% parsed	0.0%	40	0	1

4. Surprises

Three things changed the plan. 1) the original eval rewarded doing nothing, so I had to rebuild. 2) more training was not automatically better. A checkpoint is a saved version of the model after some training; Llama’s second checkpoint ended up the best best. Later checkpoints kept studying the examples but drifted toward unsafe approvals. 3) output format matters. If the app expects JSON and the model returns something else often after sevearl evals. It is a broken handoff.

Checkpoint drift across 8 epochs

Dev accuracy plateaus early. Unsafe approval creeps up in later epochs. We selected on product metrics — not the loss curve or final epoch.

Model

R3 training loss, all three families

8 epochs × 13 batches = 104 optimizer steps. Loss drops cleanly but that is not what selected the winner — dev-set product metrics did.

Training loss is the model’s practice-score while it studies. It tells us whether training is happening; it does not tell us whether customers are safer. I picked checkpoints on a separate 50-ticket dev set using product metrics: accuracy, unsafe approval, useful approvals, parse errors, and citations that stayed inside the provided evidence.

5. Production implications

On Tinker, Llama R3 costed about 0.13¢ per review, and retraining the LoRA adapter is roughly $0.60 per refresh. Even at 1M reviews/month, the model bill would sit in the low hundreds. The real cost is mantainance: updating policy chunks, refreshing the eval when support policies change, re-auditing labels, and deciding who owns an unsafe approval.

Cost at scale, per method

Tinker prefill/sample/train rates from docs. Other platforms from public pricing. Token shape: R0 ticket only (~1.2k in / 0.5k out); R1/R2/R3 + retrieved evidence (~2.8k in / 0.6k out). Numbers are list-price orientation, not quotes.

Reviews per month: 10,000

1k10k100k1000k

Base model

R3 retrains per year: 4

Monthly total (Tinker, all 4 rungs)

$22

R1 ≈ 0.06¢/review · R3 ≈ 0.06¢/review

Assumptions and what changes the answer

Token shape from V2 eval runs: R1/R2/R3 all ship ~2.8k evidence-loaded input tokens; R0 is ~1.2k.
R3 training: 104 optimizer steps × ~15k train tokens/step ≈ 1.56M tokens per retrain.
MoE pricing: Qwen/Nemotron MoE cheaper on Tinker because priced by active parameters, not total.
Other platforms: list-price blends for 8B/30B-class models, not LoRA-adapter-specific SKUs. Most managed platforms charge extra to serve a fine-tuned adapter.
Retrieval cost: not included — NimbusMail has ~2k chunks; a managed index runs $0–50/mo at this scale.
Human review: out of scope. This priced only the LLM layer.

Platform options and list prices (orientation only)

Platform	Llama-8B in/out	Qwen-30B in/out	Training / adapter note
Tinker (ours)	$0.13 / $0.40	$0.12 / $0.30	LoRA train $0.36–0.40/1M tokens
Fireworks	$0.20 / $0.20	$0.15 / $0.60	LoRA SFT $0.50/1M; H100 $6/hr
Together	$0.18 / $0.18	—	LoRA SFT 17–69B $1.50/1M
DeepInfra	$0.02 / $0.05	$0.08 / $0.28	No managed FT
Modal + vLLM	—	—	Raw GPU: H100 $3.95/hr

All prices are per 1M tokens, using list prices from 2026-04. They are orientation numbers, not procurement quotes. Source links are in the appendix.

Maintenance cost at three volumes

Rung	One-time work	Recurring	What triggers re-work
R0	Prompt + schema	Near-zero	Model swap
R1	Policy/KB chunking + retrieval eval	Chunk refresh on policy change	Policy updates, KB drift
R2	Deterministic rules (citation, risk, policy)	Rule audit quarterly	New risk categories, rule false positives
R3	Process-supervised training data (~200 rows)	LoRA retrain per policy shift; dev eval per retrain	Policy changes, label drift, eval rotation

6. Recommendation

In this short set of experiments, Llama R3 stood our, keeping R1 as the simpler fallback. But I would not treat this demo as proof on its own. Before making external claims, the eval still needs a full human audit. Still, I beleive it makes sense to add complexity only when the simpler layer has failed in a measurable way - which we have tested.

What I’d take into the next LLM system

Build the test before the method. Balance it across every decision the system must learn.
Measure safety and usefulness together. “Never says yes” can look safe while being useless.
Use retrieval when the model needs facts. Test random evidence too, so you know whether the facts actually helped.
Use gates to catch clear rule breaks, not to rescue bad reasoning.
Choose checkpoints by product behavior, not by the prettiest training-loss curve.
Treat citations, parse reliability, and maintenance as product work, not backend chores.

Appendix

R3 training config

200 training rows / 50 dev rows / 150 locked-eval rows
8 epochs, 104 optimizer steps, batch size 16
All-linear LoRA target modules, rank 16, blocker weight 6×
Training on Tinker; checkpoints stored per epoch; selected on dev product metrics
Process-supervised targets include decision, reply, claim-level support labels, citations, risk flags

Tinker models in use

Family	Tinker ID	Prefill	Sample	Train
Llama	`Llama-3.1-8B-Instruct`	$0.13	$0.40	$0.40
Qwen	`Qwen3-30B-A3B-Instruct-2507`	$0.12	$0.30	$0.36
Nemotron	`Nemotron-3-Nano-30B-A3B`	$0.13	$0.33	$0.40

Per 1M tokens. Source: tinker-docs/models.

Confusion matrices (Llama, all 5 rungs)

Llama r1-retrieved confusion — r1-retrieved

Limits and what would change the headline

Label audit. Provisional internal audit; external-human adjudication is accepted for project use but not imported into repo.
Eval size. 150 rows. Enough for an engineering signal, not an academic benchmark.
Parse reliability. Qwen R3 recovers to 150/150 with a parser-repair pass. Framed as “Qwen R3 + repair,” not raw R3.
R2 gate is intentionally simple and over-blocks. A claim-level verifier (R2.2) does better but still trails R3.
Synthetic data is source-grounded and leakage-audited, but synthetic nonetheless.
Cost model ignores retrieval infra and human review labor — both are small at this scale but real.

Source docs and artifacts

Final report: docs/final-ml-product-report-polished-2026-04-20.md
Rubric: docs/grounded-reply-eval-rubric.md
Decision log: docs/decision-log.md
Teaching phases: docs/teaching/phase-00 through phase-08
Final matrix CSV: evals/grounded-evalstats/v2-balanced-final-matrix.csv
Bootstrap summary: evals/grounded-evalstats/v2-balanced-bootstrap-summary.md
W&B run: nad707-self/relay/runs/wqtldjb5