Langfuse
Observations are now reachable via MCP, letting agents and external tools query individual LLM calls, tool executions, and agent steps directly.1
Production monitoring is getting its own surface — a monitors schema and service landed alongside pre-built evaluator templates for production regression triage. The new observation type filter in dashboard widgets makes it easier to slice metrics by span type without writing custom queries.2
Experiments graduated to a first-class feature — they now live alongside Datasets rather than nested beneath them. Rebuilt screens support standalone runs without requiring a linked dataset, expose SDK-triggered experiments in the UI, and display visual deltas on scores, cost, and latency against a configurable baseline.3 Both the Python SDK and JS SDK gained experiment runner context to feed results back into these screens.
LLM-as-a-Judge evaluators now cover three verdict types — numeric (existing), boolean for true/false decisions (detecting out-of-scope requests, user disagreement, or safety flags), and categorical for fixed label sets such as correct/partially_correct/incorrect.4 All three flow into existing dashboards, filters, and score analytics.5
Self-service SSO and a Tokyo region removed two common blockers — organizations can configure SSO through a self-service flow that verifies ownership via DNS.6 A dedicated Japan cloud region keeps trace, prompt, and evaluation data inside Japan without routing through other regions.7 The same release hardened outbound URL validation against SSRF bypasses.
The v4 observation-centric model is now the default for new accounts — every LLM call, tool execution, and agent step lands as a row in the observations table. Blob export field groups are now configurable per export job, and email verification on signup shipped to Cloud.8