lubu labs
Back to Blog
LangSmith

LLM as Judge in LangSmith: Automated Evaluation That Actually Scales

LLM-as-Judge in LangSmith scales quality evaluation without human review—here's how to configure it, calibrate it, and avoid the biases that break it.

Simon Budziak
Simon Budziak
CTO
LLM as Judge in LangSmith: Automated Evaluation That Actually Scales

We shipped a RAG pipeline for a client's internal knowledge base. It cleared every test we ran — schema validation, retrieval precision checks, a handful of manually reviewed responses. Production looked clean for the first two weeks. Then a user flagged an answer that confidently cited a policy that had been updated two months prior. We pulled the traces. The hallucination pattern had been there since week one — just at a frequency low enough to survive our spot-check cadence.

That's the structural problem with manual evaluation at scale: you can't sample your way to confidence when a five-percent failure rate means hundreds of bad outputs a week. Spot-checking fifty conversations is not the same as having a signal.

LLM-as-Judge is the pattern that closes that gap. Instead of a human reviewing each output, a capable model scores it against explicit criteria — returning a structured verdict and a chain-of-thought explanation. LangSmith makes this operationally straightforward to wire into both your offline experiment pipeline and live production monitoring.

The Evaluation Gap

The instinct after shipping an AI pipeline is to add more tests. The problem is that the failure modes that matter most — hallucinations, irrelevant responses, tone drift — don't have clean programmatic definitions.

ROUGE and BERTScore measure surface-level token overlap or embedding similarity. They'll tell you a response is "close" to a reference answer but miss factual inversion entirely. A response that says the opposite of the reference with similar vocabulary scores well.

Schema and regex checks catch structural failures. They don't tell you whether the content inside the structure is correct. A well-formatted JSON response with three hallucinated bullet points passes every schema validator you write.

The gap is semantic quality — the thing that determines whether a user can actually trust the output. Rule-based systems can't measure it. Human review doesn't scale. LLM-as-Judge sits in the middle: automated, semantic, and cheap enough to run continuously.

LLM as Judge: The Core Concept

The mechanism is straightforward. You give a capable model an evaluation prompt that defines your criteria, then feed it the application's input and output (and optionally a reference answer). The judge returns a structured score and its reasoning.

What makes this practical:

  • Reference-free evaluation — you don't need a ground-truth dataset for most quality dimensions. A judge can score faithfulness, helpfulness, or tone without a labeled answer.
  • Critique is easier than generation — evaluating an existing response against criteria is a simpler task than generating that response in the first place. This is why LLM judges achieve strong alignment with human reviewers despite being fully automated.
  • Scales instantly — once configured, the same judge runs across thousands of outputs without additional overhead.

Research across 250,000+ annotated evaluation cases shows LLM judges achieve approximately 85% alignment with human judgment — exceeding the ~81% inter-human agreement baseline. The judge isn't perfect, but it's more consistent than two humans reviewing the same outputs independently.

Setting It Up in LangSmith

LangSmith supports two integration paths: the openevals SDK for programmatic use in experiments and CI, and the LangSmith UI for deploying judges against live traces.

SDK path: openevals

The openevals package is LangChain's official evaluator library. It ships with prebuilt prompts for common evaluation dimensions — correctness, conciseness, helpfulness — and a factory function for building custom judges.

Install it alongside the LangSmith SDK:

bash
pip install openevals langsmith

Basic usage with a prebuilt prompt:

python
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT
 
evaluator = create_llm_as_judge(
    prompt=CORRECTNESS_PROMPT,
    model="openai:gpt-4o",
)
 
result = evaluator(
    inputs={"question": "What is our refund policy for enterprise contracts?"},
    outputs={"answer": "Enterprise contracts have a 30-day refund window per section 4.2."},
    reference_outputs={"answer": "Enterprise contracts: 30-day refund window, section 4.2."},
)
# result: {"score": True, "reasoning": "The answer correctly identifies..."}

For domain-specific evaluation, write a custom prompt using {{input}}, {{output}}, and {{reference}} as variable placeholders. Here's a faithfulness judge for a RAG pipeline — the dimension where hallucinations actually surface:

python
from openevals.llm import create_llm_as_judge
from langchain_anthropic import ChatAnthropic
 
FAITHFULNESS_PROMPT = """
You are evaluating whether an AI response is grounded in the provided context.
 
CRITERIA:
- PASS: Every factual claim in the response is supported by the context
- FAIL: The response contains claims not found in or contradicted by the context
 
Context: {{reference}}
User Query: {{input}}
AI Response: {{output}}
 
Reasoning: Analyze each factual claim against the context before scoring.
Score: [PASS/FAIL]
"""
 
faithfulness_judge = create_llm_as_judge(
    prompt=FAITHFULNESS_PROMPT,
    model="anthropic:claude-sonnet-4-6",
)

Using PASS/FAIL instead of a 1–10 scale is deliberate — more on why in the next section.

Running evaluations against a dataset

Once you have a judge, wire it into evaluate() to run it against a LangSmith dataset. This is the right pattern for regression testing — run the judge against a golden dataset whenever your prompt or retrieval logic changes:

python
from langsmith import Client
from langsmith.evaluation import evaluate
from langchain_core.runnables import RunnableLambda
 
client = Client()
 
def run_rag_pipeline(inputs: dict) -> dict:
    # fetch context, call model, return answer
    context = retrieve(inputs["question"])
    answer = generate(inputs["question"], context)
    return {"answer": answer, "context": context}
 
results = evaluate(
    run_rag_pipeline,
    data="rag-golden-dataset",       # dataset name in LangSmith
    evaluators=[faithfulness_judge],
    experiment_prefix="faithfulness-v2",
)

LangSmith records every run, score, and reasoning trace in the experiment view. You can diff pass rates between experiment versions to catch regressions before they reach production.

UI path: live trace monitoring

For continuous evaluation against production traffic, configure the judge directly in the LangSmith UI. Navigate to a project, open the Evaluators panel, click + Evaluator, choose LLM-as-Judge, define your prompt using the same {{input}}/{{output}}/{{reference}} syntax, and map the variables to the fields in your trace schema. The judge runs automatically on new observations.

Supported score types: categorical (PASS/FAIL, or custom labels) and numeric (continuous 0.0–1.0). Use categorical for binary quality gates; use numeric only when you need to track gradual drift over time.

The Biases You Need to Know About

Deploying a judge without understanding its failure modes will give you misleadingly high scores. These are the biases that show up consistently in production:

BiasWhat HappensFix
Verbosity biasLonger responses score higher regardless of qualityAdd "penalize unnecessary detail" to your criteria
Position biasIn pairwise comparisons, judge favors the first-listed responseRandomize presentation order across runs
Self-enhancement biasModel over-scores outputs from models in the same familyUse a different model family for judging than for generation
Non-determinismSame input produces different scores across runsSet temperature=0, use binary scales
Scale arbitrariness1–10 scores are inconsistent; 7 means different things across runsUse PASS/FAIL or 3-category scales

The scale problem is the most common mistake I see teams make. A 1–10 numeric scale feels more expressive, but research shows it drops evaluator consistency from ~77% to ~65%. Binary scoring — PASS/FAIL with explicit definitions for each label — is the right default. Add granularity only when you have a concrete reason and have verified the judge's consistency on that dimension first.

The Rule: Use a different model family to judge than the one generating your outputs. An OpenAI judge evaluating OpenAI outputs will systematically score them higher than it would score equivalent outputs from Anthropic or Gemini. This isn't theoretical — it shows up in production dashboards as inflated baseline scores that mask real degradation.

Calibrating with Align Evals

A judge that isn't calibrated to your team's standards is just automation noise. LangSmith's Align Evals feature addresses this directly.

The workflow:

  1. Build a golden dataset — 20 to 50 examples covering strong outputs, weak outputs, and edge cases representative of your domain
  2. Manually score each example — your team labels the ground truth
  3. Run the judge and compare scores side-by-side with human labels in the LangSmith UI
  4. Identify systematic disagreements — for example, the judge over-scoring verbose responses or under-scoring terse-but-correct ones
  5. Refine the prompt — add clarifying criteria, negative examples, or explicit label definitions
  6. Measure the alignment score — LangSmith quantifies judge-to-human agreement; iterate until it's acceptable for your use case

LangSmith also ships a self-improving evaluator mechanism: when you correct a judge's verdict in the UI, that correction is stored as a few-shot example and automatically incorporated into future evaluation prompts. Over time the judge learns from your team's feedback without manual prompt engineering.

Perspective: Don't wait until you have a perfect golden dataset to start. Deploy a basic judge against live traffic first — imperfect calibration still catches obvious failures and builds the log of human corrections that Align Evals needs to improve. The first calibration pass with 20–30 labeled examples typically moves alignment by 10–15 percentage points.

Takeaway

LLM-as-Judge is the layer that makes AI pipelines observable in the dimension that matters most: whether the outputs are actually correct and useful, not just structurally valid.

Here's what carries forward:

  • Start with PASS/FAIL — don't over-engineer scales before you know the judge is calibrated. Binary scoring is more consistent and easier to track in dashboards.
  • Cross-family judging — if your application runs on Claude, judge with GPT-4o. If it runs on GPT-4o, judge with Claude. Self-enhancement bias is real and it inflates your baseline.
  • Calibrate early, not late — 20–50 manually labeled examples run through Align Evals before you trust the judge in CI will save you from acting on systematically wrong scores for weeks.
  • Judges catch regressions; humans catch edge cases — LLM-as-Judge is a regression signal, not a replacement for periodic human review. Run the judge continuously; audit outliers manually.

Building an eval pipeline for your AI application? Let's talk.

Sources & Further Reading

Ready to Transform Your Business?

Let's discuss how Lubu Labs can help you leverage AI to drive growth and efficiency.