Score Analytics with Multi-Score Comparison — Langfuse Changelog

Validate evaluation reliability and uncover insights with comprehensive score analysis. Score Analytics now provides comprehensive tools for analyzing and comparing evaluation scores across your LLM application.

Key Features:

Multi-Score Comparison: Compare any two scores of the same data type to validate evaluation reliability with correlation metrics, confusion matrices, and alignment patterns
Statistical Validation: Measure agreement with Pearson correlation, Cohen's Kappa, F1 scores, and other metrics with badge indicators for quick interpretation
Multi-Data Type Support: Analyze numeric (continuous), categorical (discrete), or boolean (binary) scores with type-appropriate visualizations
Matched vs All Analysis: Toggle between matched data to measure alignment or view all data for coverage and individual distributions
Temporal Insights: Track score evolution over time with configurable intervals to identify quality regressions or improvements

Use Cases: Validate LLM judge reliability, measure human-AI annotation agreement, identify coverage gaps, spot quality regressions, and discover feature relationships through score comparison.

Score Analytics with Multi-Score Comparison