Langfuse shipped a major architectural shift toward observation-centric data modeling while expanding evaluation capabilities. The v4 release rebuilt the core table structure to treat individual operations—LLM calls, tool executions, agent steps—as first-class query units instead of nesting them under traces, unlocking significantly faster filtering and aggregation at scale. This foundation enabled observation-level evaluations that score specific operations rather than entire workflows, and boolean and categorical LLM-as-a-Judge scores that move beyond numeric gradients. Experiments got a speed overhaul with faster UI loading and the ability to run against versioned datasets for reproducibility, while new tools like the CLI and corrected outputs for human-in-the-loop workflows expanded how teams interact with production data.
Langfuse shipped its v4 architecture redesign centered on observations as the primary data unit instead of traces, enabling significantly faster performance at scale. The unified observations table replaced the previous trace-centric model, with Observations API v2 and Metrics API v2 designed for quicker filtering and aggregation across dashboards, evaluations, and large datasets. LLM-as-a-Judge evaluators graduated to support categorical scores alongside numeric ones, letting you classify outcomes like correct/partially_correct/incorrect directly in templates, while dashboard computation shifted to observation-level metrics with server-side score histograms and high-cardinality dimension queries now requiring explicit top-N limits.