Observation-level evaluations enable precise operation-specific scoring for production monitoring. LLM-as-a-Judge evaluations can now be run on individual observations—LLM calls, retrievals, tool executions, or any operation within your traces. Previously, evaluations could only be run on entire traces.
Key Features:
- Evaluate only specific operations (e.g., final generation for helpfulness, retrieval step for relevance) rather than entire workflows
- Each operation gets its own score attached to specific observations in the trace tree
- Filter by observation type, name, metadata, or trace-level attributes to target exactly what matters
- Run different evaluators on different operations simultaneously
Benefits:
- Operation-level precision: Target final LLM responses, retrieval steps, or specific tool calls, reducing evaluation volume and cost
- Compositional evaluation: Stack observation filters (type, name, metadata) with trace attributes filters (userId, sessionId, tags)
- Scalable architecture: Built for high-volume workloads with fast, reliable evaluation at scale
Requirements:
- SDK version: Python v3+ (OTel-based) or JS/TS v4+ (OTel-based)
- Use propagate_attributes() in instrumentation to filter observations by trace-level attributes
Existing trace-level evaluators continue to work, with an upgrade guide available for migration.