Categorical LLM-as-a-Judge Scores — Langfuse Changelog

LLM-as-a-Judge evaluators in Langfuse can now return categorical scores in addition to numeric ones. You can define a fixed set of allowed categories in the evaluator template, have the judge choose from them, and store the result as a native categorical score in Langfuse.

This is especially useful when the right answer is a label instead of a gradient:

Classify answers as correct, partially_correct, or incorrect
Mark support replies as resolved, needs_follow_up, or escalate
Label safety outcomes as safe, needs_review, or blocked

What's New:

Choose Numeric or Categorical when creating a custom LLM-as-a-Judge evaluator
Define the allowed category values directly in the evaluator template
Optionally allow multiple matches when more than one label applies; Langfuse creates one score per selected category
View categorical results in evaluator logs and reuse them across Langfuse's existing score tooling