Validate evaluation reliability and uncover insights with comprehensive score analysis. Score Analytics now provides comprehensive tools for analyzing and comparing evaluation scores across your LLM application.
Key Features:
- Multi-Score Comparison: Compare any two scores of the same data type to validate evaluation reliability with correlation metrics, confusion matrices, and alignment patterns
- Statistical Validation: Measure agreement with Pearson correlation, Cohen's Kappa, F1 scores, and other metrics with badge indicators for quick interpretation
- Multi-Data Type Support: Analyze numeric (continuous), categorical (discrete), or boolean (binary) scores with type-appropriate visualizations
- Matched vs All Analysis: Toggle between matched data to measure alignment or view all data for coverage and individual distributions
- Temporal Insights: Track score evolution over time with configurable intervals to identify quality regressions or improvements
Use Cases: Validate LLM judge reliability, measure human-AI annotation agreement, identify coverage gaps, spot quality regressions, and discover feature relationships through score comparison.