TL;DR.I tested a small AI reviewer the way you might test a new employee: first with no handbook, then with the right handbook pages, then with a rule checker beside it, then after a short training round. The best version developed could say, in effect: “Here is the policy I used, here is the risk I see, and here is why this ticket should or should not be approved.”
Winner ModelLlama 3.1 8b-instruct trained with LoRA · 95.3% decision accuracy · no unsafe approvals · all 40 safe approvals kept · no broken JSON · no mystery citations.
For this exploration, the practical question was that if I put an LLM reviewer in front of support replies, what has to support at mimimum it so it behaves like a useful teammate instead of a very confident autocomplete box?
Two easy answers usually appear when discussing solutions with colleages: “give it documents” or “fine-tune it.” But those are names, not plans. So I built a small environments where I could watch each idea work on a product. NimbusMail has realistic support rules: refunds, billing disputes, deliverability problems, account deletion, and countries where service is restricted.
While review-screen demo is the visible artifact. The real product here is the measurement trail of approaches used which could hopefully provide a way to make the same decisions again for a different workflow.
I chose Tinker Labs to provide model inference and training and settled on 3 models due to their parameter count and performance. In LLM work, an eval is just a test set. “Locked evals” means I do not keep changing the test until the model looks good. So in order measure our LLMs's work, I generated a 150-ticket exam set based off the policy/KB and locked it before looking at the final answers. The exam was balanced across four support decisions: approve, edit, escalate, or reject. Then I decided on four test environments for our LLMs: R0 is where LLMs gets only the ticket, R1 gets relevant policy snippets, R2 adds a hard rule checker, and R3 adds a small trained adapter called LoRA. Think of LoRA as a clip-on memory layer: it nudges the model toward the behavior it practiced earlier without rebuilding the whole model.
Ticket only. What does the model do with its prior?
Ticket plus top-K NimbusMail policy and KB chunks.
Same prompt, wrong chunks. Grounding sanity check.
Policy/citation/risk verifier on top of R1 outputs.
All-linear LoRA trained on decisions + claim spans + citations + risk flags.
Retrieval means fetching the likely-relevant policy pages before the model answers, the way a support rep might open the handbook. On this eval, retrieval usually found the right page: 0.89 mean gold-chunk coverage and 0.97 any coverage. The labels have an internal audit today; the external human audit is accepted for project use, with raw annotator files still to import.
The first test had a hidden flaw: it contained zero approve-labelled rows. That made “never approve anything” look smart. Replacing that test was the biggest quality lift in the project.
R0 is the model walking into the exam with no handbook. It ended up looking safe because it barely says yes. Llama made only 4 approvals out of the 40 tickets that were safe to approve; Qwen and Nemotron make zero. This is known as conservative collapse: where the LLM tries to avoid danger by refusing to help.
40 safe-to-approve tickets live in a balanced eval set. The goal is isn't “low unsafe approval.” But to keep safe approvals while blocking unsafe ones.
R0 no evidence: 4 of 40 safe approvals preserved. Approves almost nothing → unsafe = 0% but useful = 0%.
R1 gives the model retrieved evidence, meaning the ticket plus the handbook pages (in embedded chunks) that seem relevant. For Llama, accuracy rose to 66%, and approvals return to a manageable territory. I also tried giving random policy snippets which made performance dropped by more than 10 points. This tells tells us the content of the evidence mattered, not just the structure of a citation block.
R2 adds a deterministic gate (just means it follows fixed rules). If a reply lacks a citation, trips a risk flag, or conflicts with policy, it's blocked. This is useful but here it became too strict. It removed unsafe approvals by blocking every approval, including the good ones. A gate can help catch clear violations but often cannot turn a confused answer into a good one.
All rows on the 150-row balanced locked eval. Switch metric to see how each method ladder differs by family.
R3 is where the system started to look like a trained reviewer. I did not only train it on final answers but on process as well. Like, which decision to make, which claim was supported by policy, which citation belonged to that claim, and which risks should block approval. Llama in R3 reached 95.3% accuracy, kept all 40 safe approvals, made no unsafe approvals, and returned clean JSON every time. Qwen showed a strong research signal but broke JSON on 7 rows. Unfortunately a broken output is still broken product.
Checkpoints selected on dev. The locked 150-row eval is touched once per winner, so tuning pressure never leaks into the headline numbers.
Process-supervised targets: decision + reply + claim spans + citations + risk flags.
Checkpoint are selected on dev product metrics, not loss. Epoch 1–8 scored, best kept.
Balanced decision mix (40/40/45/25). Touched once per selected winner.
Top-left is the product goal: high decision accuracy, low unsafe approval. R3 LoRA points cluster in the top-left corner; R1 without a gate wanders right; R2 drags down usefulness.
Decision-accuracy improvement with 95% CI. Positive = candidate wins. Dots right of 0 = reliably better.
Same results as the interactive view, frozen as a paper-style reference.

| Model | Method | Parsed | Accuracy | Unsafe | Approvals | OOB cites | Errors |
|---|---|---|---|---|---|---|---|
| Llama | R0 | 149/150 | 41.6% | 0.0% | 4 | 0 | 1 |
| Llama | R1 retrieved | 150/150 | 66.0% | 3.3% | 42 | 2 | 0 |
| Llama | R1 random | 150/150 | 55.3% | 4.0% | 32 | 3 | 0 |
| Llama | R2 gate | 150/150 | 44.7% | 0.0% | 0 | 2 | 0 |
| Llama | R3 LoRA | 150/150 | 95.3% | 0.0% | 40 | 0 | 0 |
| Qwen | R0 | 149/150 | 26.8% | 0.0% | 0 | 1 | 1 |
| Qwen | R1 retrieved | 150/150 | 40.7% | 30.0% | 82 | 1 | 0 |
| Qwen | R1 random | 150/150 | 38.7% | 14.7% | 48 | 37 | 0 |
| Qwen | R2 gate | 150/150 | 32.0% | 22.0% | 52 | 1 | 0 |
| Qwen | R3 LoRA | 143/150 | 100.0% parsed | 0.0% | 40 | 0 | 7 |
| Nemotron | R0 | 132/150 | 31.8% | 0.0% | 0 | 155 | 18 |
| Nemotron | R1 retrieved | 149/150 | 48.3% | 0.7% | 33 | 20 | 1 |
| Nemotron | R1 random | 150/150 | 38.0% | 6.0% | 31 | 41 | 0 |
| Nemotron | R2 gate | 149/150 | 34.9% | 0.0% | 12 | 20 | 1 |
| Nemotron | R3 LoRA | 149/150 | 89.9% parsed | 0.0% | 40 | 0 | 1 |
Three things changed the plan. 1) the original eval rewarded doing nothing, so I had to rebuild. 2) more training was not automatically better. A checkpoint is a saved version of the model after some training; Llama’s second checkpoint ended up the best best. Later checkpoints kept studying the examples but drifted toward unsafe approvals. 3) output format matters. If the app expects JSON and the model returns something else often after sevearl evals. It is a broken handoff.
Dev accuracy plateaus early. Unsafe approval creeps up in later epochs. We selected on product metrics — not the loss curve or final epoch.
8 epochs × 13 batches = 104 optimizer steps. Loss drops cleanly but that is not what selected the winner — dev-set product metrics did.
Training loss is the model’s practice-score while it studies. It tells us whether training is happening; it does not tell us whether customers are safer. I picked checkpoints on a separate 50-ticket dev set using product metrics: accuracy, unsafe approval, useful approvals, parse errors, and citations that stayed inside the provided evidence.
On Tinker, Llama R3 costed about 0.13¢ per review, and retraining the LoRA adapter is roughly $0.60 per refresh. Even at 1M reviews/month, the model bill would sit in the low hundreds. The real cost is mantainance: updating policy chunks, refreshing the eval when support policies change, re-auditing labels, and deciding who owns an unsafe approval.
Tinker prefill/sample/train rates from docs. Other platforms from public pricing. Token shape: R0 ticket only (~1.2k in / 0.5k out); R1/R2/R3 + retrieved evidence (~2.8k in / 0.6k out). Numbers are list-price orientation, not quotes.
| Platform | Llama-8B in/out | Qwen-30B in/out | Training / adapter note |
|---|---|---|---|
| Tinker (ours) | $0.13 / $0.40 | $0.12 / $0.30 | LoRA train $0.36–0.40/1M tokens |
| Fireworks | $0.20 / $0.20 | $0.15 / $0.60 | LoRA SFT $0.50/1M; H100 $6/hr |
| Together | $0.18 / $0.18 | — | LoRA SFT 17–69B $1.50/1M |
| DeepInfra | $0.02 / $0.05 | $0.08 / $0.28 | No managed FT |
| Modal + vLLM | — | — | Raw GPU: H100 $3.95/hr |
All prices are per 1M tokens, using list prices from 2026-04. They are orientation numbers, not procurement quotes. Source links are in the appendix.
| Rung | One-time work | Recurring | What triggers re-work |
|---|---|---|---|
| R0 | Prompt + schema | Near-zero | Model swap |
| R1 | Policy/KB chunking + retrieval eval | Chunk refresh on policy change | Policy updates, KB drift |
| R2 | Deterministic rules (citation, risk, policy) | Rule audit quarterly | New risk categories, rule false positives |
| R3 | Process-supervised training data (~200 rows) | LoRA retrain per policy shift; dev eval per retrain | Policy changes, label drift, eval rotation |
In this short set of experiments, Llama R3 stood our, keeping R1 as the simpler fallback. But I would not treat this demo as proof on its own. Before making external claims, the eval still needs a full human audit. Still, I beleive it makes sense to add complexity only when the simpler layer has failed in a measurable way - which we have tested.
200 training rows / 50 dev rows / 150 locked-eval rows8 epochs, 104 optimizer steps, batch size 1616, blocker weight 6×| Family | Tinker ID | Prefill | Sample | Train |
|---|---|---|---|---|
| Llama | Llama-3.1-8B-Instruct | $0.13 | $0.40 | $0.40 |
| Qwen | Qwen3-30B-A3B-Instruct-2507 | $0.12 | $0.30 | $0.36 |
| Nemotron | Nemotron-3-Nano-30B-A3B | $0.13 | $0.33 | $0.40 |
Per 1M tokens. Source: tinker-docs/models.
docs/final-ml-product-report-polished-2026-04-20.mddocs/grounded-reply-eval-rubric.mddocs/decision-log.mddocs/teaching/phase-00 through phase-08evals/grounded-evalstats/v2-balanced-final-matrix.csvevals/grounded-evalstats/v2-balanced-bootstrap-summary.md