llm-evaluation

LLM-as-Judge patterns for evaluating AI agent outputs — direct scoring with rubrics, pairwise comparison with bias mitigation, and domain-specific rubric generation. Use when assessing quality of AI-generated content, comparing model outputs, or building evaluation pipelines.

Model	Source
sonnet	pack: context-engineering

Overview

Situation	Pattern
Score a single output against criteria	Direct scoring
Choose best output from two candidates	Pairwise comparison
Consistent evaluation across many outputs	Rubric-based scoring
High-stakes evaluation with bias concern	Position-swap + averaging

Full Reference

┏━ 🔍 llm-evaluation ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ LLM-as-Judge patterns for evaluating AI outputs ┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

LLM Evaluation

LLM-as-Judge uses a language model to evaluate outputs from another model (or the same one). Works for tasks where human judgment is expensive, inconsistent, or doesn’t scale — code quality, writing tone, factual accuracy, instruction following.

When to Use

Situation	Pattern
Score a single output against criteria	Direct scoring
Choose best output from two candidates	Pairwise comparison
Consistent evaluation across many outputs	Rubric-based scoring
High-stakes evaluation with bias concern	Position-swap + averaging

Core Principles

CoT before score — always make the judge reason before outputting a score. Raw scores without justification are unreliable and unauditable.

Bias is the enemy — position, length, and self-enhancement biases systematically skew LLM judges. Mitigate by design, not by hoping.

Rubrics reduce variance — open-ended “rate this 1-10” produces noise. Criterion-anchored rubrics with per-score descriptions produce signal.

Calibrate strictness — lenient rubrics overrate everything; strict rubrics underrate everything. Choose based on your deployment context.

Reference Index

I want to…	File
Score outputs with weighted criteria and CoT justification	`reference/scoring-patterns.md`
Compare two outputs with bias mitigation and tie detection	`reference/pairwise-comparison.md`
Understand and eliminate systematic judge biases	`reference/bias-mitigation.md`
Generate domain-specific rubrics at different strictness levels	`reference/rubric-generation.md`

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Quick Start

Direct Scoring

JUDGE_PROMPT = """
You are evaluating an AI assistant's response.

<task>{task}</task>
<response>{response}</response>

Evaluate on these criteria:
- Accuracy (40%): Is the information correct and complete?
- Clarity (30%): Is the response easy to understand?
- Conciseness (30%): Does it avoid unnecessary verbosity?

Think through each criterion step by step, then output a JSON score object.

<evaluation>
[Your reasoning here]
</evaluation>

<score>
{{"accuracy": <1-5>, "clarity": <1-5>, "conciseness": <1-5>, "weighted_total": <1-5>}}
</score>
"""

Pairwise Comparison

def evaluate_pair(task, response_a, response_b, judge_model):
    # Forward pass
    result_forward = judge(task, response_a, response_b, "A", "B")
    # Swapped pass — mitigates position bias
    result_swapped = judge(task, response_b, response_a, "B", "A")

    if result_forward == result_swapped:
        return result_forward  # High confidence
    else:
        return "tie"           # Inconsistent — call it a tie

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Evaluation Pipeline Architecture

Input: (task, response[s])
         │
         ▼
  Rubric Selection ──── domain + strictness
         │
         ▼
  Judge Prompt Build ── CoT template + criteria
         │
         ▼
  Judge Model Call ──── parse structured output
         │
         ▼
  Bias Check ────────── position-swap if pairwise
         │
         ▼
  Score Aggregation ─── weighted average across criteria
         │
         ▼
  Output: score + justification + confidence

Calibration

Before deploying an evaluation pipeline:

Collect gold set — 50–200 human-labeled examples
Run judge — evaluate same examples with your judge prompt
Measure correlation — Cohen’s kappa or Spearman’s rho vs human labels
Iterate — adjust rubric anchors until kappa > 0.7
Monitor drift — re-calibrate every 2–4 weeks in production

Kappa	Interpretation
< 0.4	Poor — rubric needs rework
0.4–0.6	Moderate — acceptable for low-stakes
0.6–0.8	Substantial — production-ready
> 0.8	Near-perfect — ship it

Meta-Evaluation

Evaluate your judge, not just your outputs:

Consistency — same input → same score across runs (test with temperature 0)
Discriminability — scores spread across full range, not clustered at 3–4
Alignment — judge rankings match human rankings on gold set
Explainability — justifications are legible and actionable

Usage: Read the reference file for your current pattern. Each file is self-contained with prompt templates, code, and failure modes.