Evaluate Individual Operations: Faster, More Precise LLM-as-a-Judge — Langfuse Changelog

Observation-level evaluations enable precise operation-specific scoring for production monitoring. LLM-as-a-Judge evaluations can now be run on individual observations—LLM calls, retrievals, tool executions, or any operation within your traces. Previously, evaluations could only be run on entire traces.

Key Features:

Evaluate only specific operations (e.g., final generation for helpfulness, retrieval step for relevance) rather than entire workflows
Each operation gets its own score attached to specific observations in the trace tree
Filter by observation type, name, metadata, or trace-level attributes to target exactly what matters
Run different evaluators on different operations simultaneously

Benefits:

Operation-level precision: Target final LLM responses, retrieval steps, or specific tool calls, reducing evaluation volume and cost
Compositional evaluation: Stack observation filters (type, name, metadata) with trace attributes filters (userId, sessionId, tags)
Scalable architecture: Built for high-volume workloads with fast, reliable evaluation at scale

Requirements:

SDK version: Python v3+ (OTel-based) or JS/TS v4+ (OTel-based)
Use propagate_attributes() in instrumentation to filter observations by trace-level attributes

Existing trace-level evaluators continue to work, with an upgrade guide available for migration.

Evaluate Individual Operations: Faster, More Precise LLM-as-a-Judge

Observation-level evaluation in trace tree