Rebuilt experiment screens with faster loading, standalone access, and enhanced filtering for efficient analysis.
Key improvements:
Run A/B tests between model versions, compare evaluation scores across prompt variants, or triage regressions with quicker feedback loops. Currently in open beta on Langfuse Cloud only.
LLM-as-a-Judge evaluators can now return boolean scores for true / false decisions. This makes it easier to model simple binary decisions directly as native boolean scores and analyze them across existing score tooling.
Key features:
Boolean when creating a custom LLM-as-a-Judge evaluatortrue / false outcomes as native boolean scoresUse cases: Detect User Disagreement, Out-of-Scope Requests, or Insufficient Answers as true/false decisions. Boolean scores complement existing numeric and categorical score types.
Detailed reference for how dashboards behave differently in Langfuse v4 with the new observation-centric data model. Key changes include:
uniq(trace_id) from the wide observations table instead of counting rows in the traces tableRead the full reference page for all changes and expected numerical differences when switching to the new observation-centric data model.
LLM-as-a-Judge evaluators in Langfuse can now return categorical scores in addition to numeric ones. You can define a fixed set of allowed categories in the evaluator template, have the judge choose from them, and store the result as a native categorical score in Langfuse.
This is especially useful when the right answer is a label instead of a gradient:
correct, partially_correct, or incorrectresolved, needs_follow_up, or escalatesafe, needs_review, or blockedWhat's New:
Numeric or Categorical when creating a custom LLM-as-a-Judge evaluatorLangfuse rolled out a simplified architecture with significantly faster product performance at scale. The main table is now observations — every LLM call, tool execution, and agent step is a row you can query directly, replacing the previous trace-centric view.
Key Changes:
What's Faster:
Migration Required:
See Fast Preview Docs for rollout details, access, and migration steps.
Fully use Langfuse from the CLI. Built for AI agents and power users.
A new CLI that wraps the entire Langfuse API, allowing agents and humans to interact with Langfuse directly from the terminal. Covers all API functionality including traces, observations, prompts, datasets, scores, sessions, metrics, and more.
Key capabilities:
npx langfuse-cli api <resource> <action>Use cases:
Observation-level evaluations enable precise operation-specific scoring for production monitoring. LLM-as-a-Judge evaluations can now be run on individual observations—LLM calls, retrievals, tool executions, or any operation within your traces. Previously, evaluations could only be run on entire traces.
Key Features:
Benefits:
Requirements:
Existing trace-level evaluators continue to work, with an upgrade guide available for migration.
Fetch datasets at specific version timestamps and run experiments directly on versioned datasets across UI, API, and SDKs for full reproducibility.
Retrieve datasets as they existed at any timestamp via Python SDK, JS/TS SDK, or Langfuse UI. By default, APIs return the latest version. Navigate to Datasets → Select dataset → Items Tab → Toggle Version view to browse all historical versions.
Execute experiments against specific dataset versions using the experiment runner or via UI. When running experiments in the UI: (1) Navigate to Run Prompt Experiment, (2) Select your dataset, (3) Choose a version from the Dataset Version dropdown, (4) The experiment runs against that specific dataset state, (5) If no version is selected, runs against latest version.
This completes the dataset versioning feature released in December.
Added the ability to capture improved versions of LLM outputs directly in trace views. Domain experts can now add corrected outputs to traces and observations, with diff views comparing original and corrected outputs. Corrections are accessible via the API as scores with dataType: "CORRECTION". Key capabilities include:
Anchor comments to specific text selections within trace and observation input, output, and metadata fields. Similar to Google Docs, you can now select text in input, output, or metadata fields, click the hovering comment button, and add comments anchored to that selection. Text-specific comments make it easy to discuss parts of LLM responses and assign them to people via mentions. This feature is available in the "JSON Beta" view.
Add filtering, table columns, and dashboard widgets for analyzing tool usage in your LLM applications. Tool calls are now a distinct data structure in Langfuse, allowing you to filter observations by available and called tools and build dashboards for tool calls over time.
Tool Call Filters on the Observations Table:
Dashboard Widgets for Tool Calls:
toolDefinitions and toolCalls with user-selectable aggregation (SUM, AVG, MAX, MIN, percentiles)toolNames to break down metrics by tool nameSupported Frameworks: OpenAI, Langchain/LangGraph, Vercel AI SDK, Google ADK/Vertex AI, Microsoft Agent Framework. Pydantic AI and additional frameworks coming in future releases.
Note: This feature only works with traces ingested after this release and does not apply to historical data.
New high-performance v2 APIs for metrics and observations with cursor-based pagination, selective field retrieval, and optimized data architecture.
Why v2? The v1 /public/traces and /public/observations endpoints have been among the most resource-intensive APIs to serve. v2 addresses performance issues: no single way to request partial data, offset pagination doesn't scale, and JSON parsing overhead. v2 is built on an optimized immutable data model that requires fewer joins and eliminates deduplication at query time.
Metrics API v2 (GET /api/public/v2/metrics) - Built on optimized data model with significantly faster query performance. The traces view is no longer available in v2; use the observations view instead. Default limit of 100 rows per query. High cardinality dimensions like id, traceId, userId, and sessionId can no longer be used for grouping in v2, though remain available for filtering.
Observations API v2 (GET /api/public/v2/observations) - Redesigned endpoint with selective field retrieval (specify field groups: core, basic, io, usage), cursor-based pagination, optimized I/O handling (returns strings by default; set parseIoAsJson: true when needed), and stricter limits (default 50, max 1,000).
Migration Notes: v2 APIs are additive - v1 endpoints remain available. Update API calls to use /api/public/v2/ prefix, use fields parameter to specify field groups, replace page-based pagination with cursor-based pagination, and note that parseIoAsJson defaults to false in v2 (v1 always parsed as JSON).
Status: Cloud-only (Beta), currently in beta on Langfuse Cloud. Self-hosted migration path in development. With current SDK versions, data may take ~5 minutes to appear on v2 endpoints; updated SDK versions coming soon.
Track dataset changes over time with automatic versioning on every addition, update, or deletion of dataset items.
Complete audit trail: View full history of changes at item level to understand what changed and when. Identify unintended edits and revert problematic changes.
Experiment reproducibility: Experiments are automatically tied to the exact dataset state at run time. When dataset items are modified after running an experiment, previous results remain tied to the dataset version they actually ran against.
Dataset evolution: Track how gold-label datasets improve as domain experts refine expected outputs. See exactly what changed and how it affects benchmark results.
Every addition, update, or deletion of dataset items creates a new dataset version identified by timestamp. Includes item-level versioning with full history and diffs, and dataset-level metadata tracking high-level changes.
Coming soon: API support for fetching datasets at specific version timestamps and SDK support for running experiments on specific dataset versions.
Langfuse now supports OpenAI GPT-5.2 with day 1 support across all major features. OpenAI has released GPT-5.2, bringing enhanced capabilities and performance improvements to their model lineup.
Start experimenting with GPT-5.2 in the Langfuse LLM playground, leverage it to power LLM-as-a-judge evaluations, and monitor usage with comprehensive cost tracking capabilities.
Select multiple observations from the observations table and add them to a new or existing dataset with flexible field mapping.
Key Features:
Langfuse now supports pricing tiers for models with context-dependent pricing, enabling accurate cost calculation. Some model providers charge different rates depending on the number of input tokens used. For example, Anthropic's Claude Sonnet 4.5 with 1M context window, Google's Gemini 2.5 Pro and Gemini 3 Pro Preview all apply higher pricing when more than 200K input tokens are used.
How it works: Pricing tiers allow multiple price points for a single model, each with conditions that determine when that tier applies. Tiers are evaluated in priority order, and the first matching tier is used for cost calculation.
Pre-configured models: The following models now have pricing tiers pre-configured:
Custom pricing tiers: You can also define pricing tiers for your own custom models via the Langfuse UI or API. Each tier includes a name, priority, conditions that determine when the tier applies, and prices for cost per usage type.
Langfuse now includes a native Model Context Protocol (MCP) server with write capabilities, enabling AI agents to fetch and update prompts directly.
The native MCP server is available at /api/public/mcp and provides five tools for comprehensive prompt management:
Read Operations:
getPrompt – Fetch specific prompts by name, label, or versionlistPrompts – Browse all prompts with filtering and paginationWrite Operations:
createTextPrompt – Create new text prompt versionscreateChatPrompt – Create new chat prompt versions (OpenAI-style messages)updatePromptLabels – Manage labels across prompt versionsThe native server uses a stateless architecture built directly into the platform with BasicAuth credentials, replacing the previous node package version. Includes complete setup guide and prompt-specific workflows documentation.
Langfuse now supports OpenAI GPT-5.1 with day 1 support for the LLM playground, LLM-as-a-judge evaluations, and comprehensive cost tracking. Start experimenting with GPT-5.1 in the Langfuse LLM playground, leverage it for LLM-as-a-judge evaluations, and monitor usage with comprehensive cost tracking capabilities.
Use slashes in dataset names to create folders for better organization. Put a slash (/) in the dataset name to create a folder. For example, folder1/dataset1 will create a folder named folder1 and a dataset named dataset1 inside it. The folders are virtual so the full dataset name remains folder1/dataset1 if you want to access it via the SDK or API.
Define JSON schemas for your dataset inputs and expected outputs to ensure data quality and consistency across your test datasets.
Key Features:
Benefits:
Use Cases: