Skip to content

llm-evaluation

LLM-as-Judge patterns for evaluating AI agent outputs — direct scoring with rubrics, pairwise comparison with bias mitigation, and domain-specific rubric generation. Use when assessing quality of AI-generated content, comparing model outputs, or building evaluation pipelines.

ModelSource
sonnetpack: context-engineering
SituationPattern
Score a single output against criteriaDirect scoring
Choose best output from two candidatesPairwise comparison
Consistent evaluation across many outputsRubric-based scoring
High-stakes evaluation with bias concernPosition-swap + averaging
Full Reference

┏━ 🔍 llm-evaluation ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ LLM-as-Judge patterns for evaluating AI outputs ┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

LLM-as-Judge uses a language model to evaluate outputs from another model (or the same one). Works for tasks where human judgment is expensive, inconsistent, or doesn’t scale — code quality, writing tone, factual accuracy, instruction following.

SituationPattern
Score a single output against criteriaDirect scoring
Choose best output from two candidatesPairwise comparison
Consistent evaluation across many outputsRubric-based scoring
High-stakes evaluation with bias concernPosition-swap + averaging

CoT before score — always make the judge reason before outputting a score. Raw scores without justification are unreliable and unauditable.

Bias is the enemy — position, length, and self-enhancement biases systematically skew LLM judges. Mitigate by design, not by hoping.

Rubrics reduce variance — open-ended “rate this 1-10” produces noise. Criterion-anchored rubrics with per-score descriptions produce signal.

Calibrate strictness — lenient rubrics overrate everything; strict rubrics underrate everything. Choose based on your deployment context.

I want to…File
Score outputs with weighted criteria and CoT justificationreference/scoring-patterns.md
Compare two outputs with bias mitigation and tie detectionreference/pairwise-comparison.md
Understand and eliminate systematic judge biasesreference/bias-mitigation.md
Generate domain-specific rubrics at different strictness levelsreference/rubric-generation.md

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

JUDGE_PROMPT = """
You are evaluating an AI assistant's response.
<task>{task}</task>
<response>{response}</response>
Evaluate on these criteria:
- Accuracy (40%): Is the information correct and complete?
- Clarity (30%): Is the response easy to understand?
- Conciseness (30%): Does it avoid unnecessary verbosity?
Think through each criterion step by step, then output a JSON score object.
<evaluation>
[Your reasoning here]
</evaluation>
<score>
{{"accuracy": <1-5>, "clarity": <1-5>, "conciseness": <1-5>, "weighted_total": <1-5>}}
</score>
"""
def evaluate_pair(task, response_a, response_b, judge_model):
# Forward pass
result_forward = judge(task, response_a, response_b, "A", "B")
# Swapped pass — mitigates position bias
result_swapped = judge(task, response_b, response_a, "B", "A")
if result_forward == result_swapped:
return result_forward # High confidence
else:
return "tie" # Inconsistent — call it a tie

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Input: (task, response[s])
Rubric Selection ──── domain + strictness
Judge Prompt Build ── CoT template + criteria
Judge Model Call ──── parse structured output
Bias Check ────────── position-swap if pairwise
Score Aggregation ─── weighted average across criteria
Output: score + justification + confidence

Before deploying an evaluation pipeline:

  1. Collect gold set — 50–200 human-labeled examples
  2. Run judge — evaluate same examples with your judge prompt
  3. Measure correlation — Cohen’s kappa or Spearman’s rho vs human labels
  4. Iterate — adjust rubric anchors until kappa > 0.7
  5. Monitor drift — re-calibrate every 2–4 weeks in production
KappaInterpretation
< 0.4Poor — rubric needs rework
0.4–0.6Moderate — acceptable for low-stakes
0.6–0.8Substantial — production-ready
> 0.8Near-perfect — ship it

Evaluate your judge, not just your outputs:

  • Consistency — same input → same score across runs (test with temperature 0)
  • Discriminability — scores spread across full range, not clustered at 3–4
  • Alignment — judge rankings match human rankings on gold set
  • Explainability — justifications are legible and actionable

Usage: Read the reference file for your current pattern. Each file is self-contained with prompt templates, code, and failure modes.