releases.shpreview
Langfuse/Langfuse Changelog

Langfuse Changelog

$npx -y @buildinternet/releases show langfuse-changelog
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases8Avg3/moVersionsv4
Apr 13, 2026
Compare Experiments Faster

Rebuilt experiment screens with faster loading, standalone access, and enhanced filtering for efficient analysis.

Key improvements:

  • Faster loading and filtering leveraging observation-centric data model
  • Standalone experiments that don't require linked datasets; SDK-based experiments now visible in UI
  • Polished UI with visual deltas on scores, cost, and latency, baseline comparison, and score threshold filtering

Run A/B tests between model versions, compare evaluation scores across prompt variants, or triage regressions with quicker feedback loops. Currently in open beta on Langfuse Cloud only.

Apr 8, 2026

LLM-as-a-Judge evaluators can now return boolean scores for true / false decisions. This makes it easier to model simple binary decisions directly as native boolean scores and analyze them across existing score tooling.

Key features:

  • Choose Boolean when creating a custom LLM-as-a-Judge evaluator
  • Store true / false outcomes as native boolean scores
  • Analyze boolean evaluator outputs in dashboards, filters, and score analytics

Use cases: Detect User Disagreement, Out-of-Scope Requests, or Insufficient Answers as true/false decisions. Boolean scores complement existing numeric and categorical score types.

Mar 23, 2026
Updates to Dashboards

Detailed reference for how dashboards behave differently in Langfuse v4 with the new observation-centric data model. Key changes include:

  • Trace counts are now computed as uniq(trace_id) from the wide observations table instead of counting rows in the traces table
  • "Traces by time" is replaced by "Observations by time", reflecting the shift to observations as the primary unit
  • Score histograms are now computed server-side, including all scores regardless of dataset size
  • Trace names fall back to the root observation name for OTEL-native ingestion, so previously unnamed traces now display correctly
  • High-cardinality dimensions (like userId) now require top-N queries with explicit limits
  • NULL and empty strings are treated as equivalent in filters

Read the full reference page for all changes and expected numerical differences when switching to the new observation-centric data model.

Mar 20, 2026

LLM-as-a-Judge evaluators in Langfuse can now return categorical scores in addition to numeric ones. You can define a fixed set of allowed categories in the evaluator template, have the judge choose from them, and store the result as a native categorical score in Langfuse.

This is especially useful when the right answer is a label instead of a gradient:

  • Classify answers as correct, partially_correct, or incorrect
  • Mark support replies as resolved, needs_follow_up, or escalate
  • Label safety outcomes as safe, needs_review, or blocked

What's New:

  • Choose Numeric or Categorical when creating a custom LLM-as-a-Judge evaluator
  • Define the allowed category values directly in the evaluator template
  • Optionally allow multiple matches when more than one label applies; Langfuse creates one score per selected category
  • View categorical results in evaluator logs and reuse them across Langfuse's existing score tooling
Mar 10, 2026
Simplify Langfuse for Scale (Fast Preview v4)

Langfuse rolled out a simplified architecture with significantly faster product performance at scale. The main table is now observations — every LLM call, tool execution, and agent step is a row you can query directly, replacing the previous trace-centric view.

Key Changes:

  • Unified observations table: all inputs, outputs, and context attributes live directly on observations for faster filtering and aggregation
  • Observations API v2 and Metrics API v2 designed for faster querying at scale
  • Observation-level evaluations now execute in seconds
  • Charts, filters, and APIs are much faster across Langfuse Cloud
  • trace_id works like any other filter column (session_id, user_id, score) to group related observations

What's Faster:

  • Chart loading time significantly shorter; confident loading over large time ranges
  • Faster browsing of traces, users, and sessions
  • Quicker filter responses in large projects
  • Faster evaluation workflows

Migration Required:

  • Upgrade to Python SDK v4 and JS/TS SDK v5 to avoid delays and see data in real time
  • Use saved views and filters to focus on key operations across your observations table

See Fast Preview Docs for rollout details, access, and migration steps.

Feb 17, 2026

Fully use Langfuse from the CLI. Built for AI agents and power users.

A new CLI that wraps the entire Langfuse API, allowing agents and humans to interact with Langfuse directly from the terminal. Covers all API functionality including traces, observations, prompts, datasets, scores, sessions, metrics, and more.

Key capabilities:

  • Get started with npx langfuse-cli api <resource> <action>
  • Supports 26 resources with dynamic commands wrapping the full OpenAPI spec
  • Progressive disclosure for discovery and usage
  • Authentication via environment variables or .env file
  • Designed to work with Agent Skills for seamless agent integration

Use cases:

  • Let coding agents manage Langfuse directly from the editor
  • Script workflows and automate tasks in CI/CD and bash scripts
  • Faster than the UI for quick lookups and queries
Feb 13, 2026

Observation-level evaluations enable precise operation-specific scoring for production monitoring. LLM-as-a-Judge evaluations can now be run on individual observations—LLM calls, retrievals, tool executions, or any operation within your traces. Previously, evaluations could only be run on entire traces.

Key Features:

  • Evaluate only specific operations (e.g., final generation for helpfulness, retrieval step for relevance) rather than entire workflows
  • Each operation gets its own score attached to specific observations in the trace tree
  • Filter by observation type, name, metadata, or trace-level attributes to target exactly what matters
  • Run different evaluators on different operations simultaneously

Benefits:

  • Operation-level precision: Target final LLM responses, retrieval steps, or specific tool calls, reducing evaluation volume and cost
  • Compositional evaluation: Stack observation filters (type, name, metadata) with trace attributes filters (userId, sessionId, tags)
  • Scalable architecture: Built for high-volume workloads with fast, reliable evaluation at scale

Requirements:

  • SDK version: Python v3+ (OTel-based) or JS/TS v4+ (OTel-based)
  • Use propagate_attributes() in instrumentation to filter observations by trace-level attributes

Existing trace-level evaluators continue to work, with an upgrade guide available for migration.

Feb 11, 2026

Fetch datasets at specific version timestamps and run experiments directly on versioned datasets across UI, API, and SDKs for full reproducibility.

Why versioned experiments matter

  • Full reproducibility: Re-run experiments on the exact dataset state from any point in time, even after items are updated or deleted. Reproduce results from weeks or months ago with complete confidence.
  • A/B testing with confidence: Compare model performance before and after dataset refinements. Test new prompts against the same baseline dataset version that your production model was evaluated on.
  • Regression testing: Run experiments on a specific dataset version while your team continues improving the dataset. Ensure new model versions don't regress on established benchmarks.

Fetch datasets at specific versions

Retrieve datasets as they existed at any timestamp via Python SDK, JS/TS SDK, or Langfuse UI. By default, APIs return the latest version. Navigate to Datasets → Select dataset → Items Tab → Toggle Version view to browse all historical versions.

Run experiments on versioned datasets

Execute experiments against specific dataset versions using the experiment runner or via UI. When running experiments in the UI: (1) Navigate to Run Prompt Experiment, (2) Select your dataset, (3) Choose a version from the Dataset Version dropdown, (4) The experiment runs against that specific dataset state, (5) If no version is selected, runs against latest version.

This completes the dataset versioning feature released in December.

Jan 14, 2026

Added the ability to capture improved versions of LLM outputs directly in trace views. Domain experts can now add corrected outputs to traces and observations, with diff views comparing original and corrected outputs. Corrections are accessible via the API as scores with dataType: "CORRECTION". Key capabilities include:

  • Human-in-the-loop improvement: Domain experts review production outputs and provide corrections
  • Fine-tuning data at scale: Export corrected outputs alongside original inputs to create high-quality training datasets from real production data
  • Quality benchmarking: Compare actual vs expected outputs across production traces
  • JSON validation and plain text modes: Toggle between modes to match your data format
  • Use cases: Customer support agent responses, content generation preferences, code generation fixes, structured extraction examples
Jan 7, 2026

Anchor comments to specific text selections within trace and observation input, output, and metadata fields. Similar to Google Docs, you can now select text in input, output, or metadata fields, click the hovering comment button, and add comments anchored to that selection. Text-specific comments make it easy to discuss parts of LLM responses and assign them to people via mentions. This feature is available in the "JSON Beta" view.

Dec 22, 2025

Add filtering, table columns, and dashboard widgets for analyzing tool usage in your LLM applications. Tool calls are now a distinct data structure in Langfuse, allowing you to filter observations by available and called tools and build dashboards for tool calls over time.

Tool Call Filters on the Observations Table:

  • Filter by Available Tools (count of tools available to an LLM)
  • Filter by Tool Calls (count of tools invoked by an LLM)
  • Filter by Available Tools Names (list of specific tool names available)
  • Filter by Tool Calls Names (list of specific tool names invoked)

Dashboard Widgets for Tool Calls:

  • New Metrics: toolDefinitions and toolCalls with user-selectable aggregation (SUM, AVG, MAX, MIN, percentiles)
  • New Filters: toolNames to break down metrics by tool name

Supported Frameworks: OpenAI, Langchain/LangGraph, Vercel AI SDK, Google ADK/Vertex AI, Microsoft Agent Framework. Pydantic AI and additional frameworks coming in future releases.

Note: This feature only works with traces ingested after this release and does not apply to historical data.

Dec 17, 2025
v2 Metrics and Observations API (Beta)

New high-performance v2 APIs for metrics and observations with cursor-based pagination, selective field retrieval, and optimized data architecture.

Why v2? The v1 /public/traces and /public/observations endpoints have been among the most resource-intensive APIs to serve. v2 addresses performance issues: no single way to request partial data, offset pagination doesn't scale, and JSON parsing overhead. v2 is built on an optimized immutable data model that requires fewer joins and eliminates deduplication at query time.

Metrics API v2 (GET /api/public/v2/metrics) - Built on optimized data model with significantly faster query performance. The traces view is no longer available in v2; use the observations view instead. Default limit of 100 rows per query. High cardinality dimensions like id, traceId, userId, and sessionId can no longer be used for grouping in v2, though remain available for filtering.

Observations API v2 (GET /api/public/v2/observations) - Redesigned endpoint with selective field retrieval (specify field groups: core, basic, io, usage), cursor-based pagination, optimized I/O handling (returns strings by default; set parseIoAsJson: true when needed), and stricter limits (default 50, max 1,000).

Migration Notes: v2 APIs are additive - v1 endpoints remain available. Update API calls to use /api/public/v2/ prefix, use fields parameter to specify field groups, replace page-based pagination with cursor-based pagination, and note that parseIoAsJson defaults to false in v2 (v1 always parsed as JSON).

Status: Cloud-only (Beta), currently in beta on Langfuse Cloud. Self-hosted migration path in development. With current SDK versions, data may take ~5 minutes to appear on v2 endpoints; updated SDK versions coming soon.

Dec 15, 2025
Dataset Item Versioning

Track dataset changes over time with automatic versioning on every addition, update, or deletion of dataset items.

Complete audit trail: View full history of changes at item level to understand what changed and when. Identify unintended edits and revert problematic changes.

Experiment reproducibility: Experiments are automatically tied to the exact dataset state at run time. When dataset items are modified after running an experiment, previous results remain tied to the dataset version they actually ran against.

Dataset evolution: Track how gold-label datasets improve as domain experts refine expected outputs. See exactly what changed and how it affects benchmark results.

Every addition, update, or deletion of dataset items creates a new dataset version identified by timestamp. Includes item-level versioning with full history and diffs, and dataset-level metadata tracking high-level changes.

Coming soon: API support for fetching datasets at specific version timestamps and SDK support for running experiments on specific dataset versions.

Dec 12, 2025
OpenAI GPT-5.2 support

Langfuse now supports OpenAI GPT-5.2 with day 1 support across all major features. OpenAI has released GPT-5.2, bringing enhanced capabilities and performance improvements to their model lineup.

Start experimenting with GPT-5.2 in the Langfuse LLM playground, leverage it to power LLM-as-a-judge evaluations, and monitor usage with comprehensive cost tracking capabilities.

Dec 11, 2025

Select multiple observations from the observations table and add them to a new or existing dataset with flexible field mapping.

Key Features:

  • Bulk add observations to datasets directly from the observations table
  • Build test datasets from production data by selecting multiple observations
  • Flexible field mapping system to control how observation data transforms into dataset items
  • Support for mapping entire fields as-is, extracting specific values using JSON path expressions, or building custom objects from multiple fields
  • Background processing with ability to continue working while operations complete
  • Partial success support: valid observations are added even if some fail schema validation
  • Monitor progress and view results in batch actions settings page
Dec 2, 2025

Langfuse now supports pricing tiers for models with context-dependent pricing, enabling accurate cost calculation. Some model providers charge different rates depending on the number of input tokens used. For example, Anthropic's Claude Sonnet 4.5 with 1M context window, Google's Gemini 2.5 Pro and Gemini 3 Pro Preview all apply higher pricing when more than 200K input tokens are used.

How it works: Pricing tiers allow multiple price points for a single model, each with conditions that determine when that tier applies. Tiers are evaluated in priority order, and the first matching tier is used for cost calculation.

Pre-configured models: The following models now have pricing tiers pre-configured:

  • claude-sonnet-4-5-20250929: Standard / Large Context tier with > 200K input tokens threshold
  • gemini-2.5-pro: Standard / Large Context tier with > 200K input tokens threshold
  • gemini-3-pro-preview: Standard / Large Context tier with > 200K input tokens threshold

Custom pricing tiers: You can also define pricing tiers for your own custom models via the Langfuse UI or API. Each tier includes a name, priority, conditions that determine when the tier applies, and prices for cost per usage type.

Nov 20, 2025

Langfuse now includes a native Model Context Protocol (MCP) server with write capabilities, enabling AI agents to fetch and update prompts directly.

The native MCP server is available at /api/public/mcp and provides five tools for comprehensive prompt management:

Read Operations:

  • getPrompt – Fetch specific prompts by name, label, or version
  • listPrompts – Browse all prompts with filtering and pagination

Write Operations:

  • createTextPrompt – Create new text prompt versions
  • createChatPrompt – Create new chat prompt versions (OpenAI-style messages)
  • updatePromptLabels – Manage labels across prompt versions

The native server uses a stateless architecture built directly into the platform with BasicAuth credentials, replacing the previous node package version. Includes complete setup guide and prompt-specific workflows documentation.

Nov 14, 2025

Langfuse now supports OpenAI GPT-5.1 with day 1 support for the LLM playground, LLM-as-a-judge evaluations, and comprehensive cost tracking. Start experimenting with GPT-5.1 in the Langfuse LLM playground, leverage it for LLM-as-a-judge evaluations, and monitor usage with comprehensive cost tracking capabilities.

Nov 8, 2025

Use slashes in dataset names to create folders for better organization. Put a slash (/) in the dataset name to create a folder. For example, folder1/dataset1 will create a folder named folder1 and a dataset named dataset1 inside it. The folders are virtual so the full dataset name remains folder1/dataset1 if you want to access it via the SDK or API.

JSON Schema Enforcement for Dataset Items

Define JSON schemas for your dataset inputs and expected outputs to ensure data quality and consistency across your test datasets.

Key Features:

  • Add JSON Schema validation to datasets to ensure all items conform to expected structure
  • Automatic validation of dataset items against defined schemas
  • Detailed error messages for validation failures
  • Support for validation on creation, updates, and CSV imports

Benefits:

  • Ensures data quality: All items must match defined structure
  • Catches errors early: Invalid data rejected before entering dataset
  • Improves collaboration: Shared schemas ensure consistent formatting
  • Type safety: Define exact structures for inputs and expected outputs
  • Prevents test failures: Experiments won't fail due to malformed test data

Use Cases:

  • Validate JSON responses from function calling or tool use
  • Ensure message arrays follow correct format in chat applications
  • Maintain consistency when multiple team members contribute
  • Validate CSV uploads before adding items to datasets
Previous12Next
Latest
v4
Tracking Since
Nov 4, 2025
Last fetched Apr 13, 2026