llm-evaluation
LLM-as-Judge patterns for evaluating AI agent outputs — direct scoring with rubrics, pairwise comparison with bias mitigation, and domain-specific rubric generation. Use when assessing quality of AI-generated content, comparing model outputs, or building evaluation pipelines.
| Model | Source |
|---|---|
| sonnet | pack: context-engineering |
Overview
Section titled “Overview”| Situation | Pattern |
|---|---|
| Score a single output against criteria | Direct scoring |
| Choose best output from two candidates | Pairwise comparison |
| Consistent evaluation across many outputs | Rubric-based scoring |
| High-stakes evaluation with bias concern | Position-swap + averaging |
Full Reference
┏━ 🔍 llm-evaluation ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ LLM-as-Judge patterns for evaluating AI outputs ┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
LLM Evaluation
Section titled “LLM Evaluation”LLM-as-Judge uses a language model to evaluate outputs from another model (or the same one). Works for tasks where human judgment is expensive, inconsistent, or doesn’t scale — code quality, writing tone, factual accuracy, instruction following.
When to Use
Section titled “When to Use”| Situation | Pattern |
|---|---|
| Score a single output against criteria | Direct scoring |
| Choose best output from two candidates | Pairwise comparison |
| Consistent evaluation across many outputs | Rubric-based scoring |
| High-stakes evaluation with bias concern | Position-swap + averaging |
Core Principles
Section titled “Core Principles”CoT before score — always make the judge reason before outputting a score. Raw scores without justification are unreliable and unauditable.
Bias is the enemy — position, length, and self-enhancement biases systematically skew LLM judges. Mitigate by design, not by hoping.
Rubrics reduce variance — open-ended “rate this 1-10” produces noise. Criterion-anchored rubrics with per-score descriptions produce signal.
Calibrate strictness — lenient rubrics overrate everything; strict rubrics underrate everything. Choose based on your deployment context.
Reference Index
Section titled “Reference Index”| I want to… | File |
|---|---|
| Score outputs with weighted criteria and CoT justification | reference/scoring-patterns.md |
| Compare two outputs with bias mitigation and tie detection | reference/pairwise-comparison.md |
| Understand and eliminate systematic judge biases | reference/bias-mitigation.md |
| Generate domain-specific rubrics at different strictness levels | reference/rubric-generation.md |
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Quick Start
Section titled “Quick Start”Direct Scoring
Section titled “Direct Scoring”JUDGE_PROMPT = """You are evaluating an AI assistant's response.
<task>{task}</task><response>{response}</response>
Evaluate on these criteria:- Accuracy (40%): Is the information correct and complete?- Clarity (30%): Is the response easy to understand?- Conciseness (30%): Does it avoid unnecessary verbosity?
Think through each criterion step by step, then output a JSON score object.
<evaluation>[Your reasoning here]</evaluation>
<score>{{"accuracy": <1-5>, "clarity": <1-5>, "conciseness": <1-5>, "weighted_total": <1-5>}}</score>"""Pairwise Comparison
Section titled “Pairwise Comparison”def evaluate_pair(task, response_a, response_b, judge_model): # Forward pass result_forward = judge(task, response_a, response_b, "A", "B") # Swapped pass — mitigates position bias result_swapped = judge(task, response_b, response_a, "B", "A")
if result_forward == result_swapped: return result_forward # High confidence else: return "tie" # Inconsistent — call it a tie━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Evaluation Pipeline Architecture
Section titled “Evaluation Pipeline Architecture”Input: (task, response[s]) │ ▼ Rubric Selection ──── domain + strictness │ ▼ Judge Prompt Build ── CoT template + criteria │ ▼ Judge Model Call ──── parse structured output │ ▼ Bias Check ────────── position-swap if pairwise │ ▼ Score Aggregation ─── weighted average across criteria │ ▼ Output: score + justification + confidenceCalibration
Section titled “Calibration”Before deploying an evaluation pipeline:
- Collect gold set — 50–200 human-labeled examples
- Run judge — evaluate same examples with your judge prompt
- Measure correlation — Cohen’s kappa or Spearman’s rho vs human labels
- Iterate — adjust rubric anchors until kappa > 0.7
- Monitor drift — re-calibrate every 2–4 weeks in production
| Kappa | Interpretation |
|---|---|
| < 0.4 | Poor — rubric needs rework |
| 0.4–0.6 | Moderate — acceptable for low-stakes |
| 0.6–0.8 | Substantial — production-ready |
| > 0.8 | Near-perfect — ship it |
Meta-Evaluation
Section titled “Meta-Evaluation”Evaluate your judge, not just your outputs:
- Consistency — same input → same score across runs (test with temperature 0)
- Discriminability — scores spread across full range, not clustered at 3–4
- Alignment — judge rankings match human rankings on gold set
- Explainability — justifications are legible and actionable
Usage: Read the reference file for your current pattern. Each file is self-contained with prompt templates, code, and failure modes.