releases.shpreview
Langfuse/Langfuse Changelog

Langfuse Changelog

$npx -y @buildinternet/releases show langfuse-changelog
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases8Avg3/moVersionsv4
Nov 7, 2025

Validate evaluation reliability and uncover insights with comprehensive score analysis. Score Analytics now provides comprehensive tools for analyzing and comparing evaluation scores across your LLM application.

Key Features:

  • Multi-Score Comparison: Compare any two scores of the same data type to validate evaluation reliability with correlation metrics, confusion matrices, and alignment patterns
  • Statistical Validation: Measure agreement with Pearson correlation, Cohen's Kappa, F1 scores, and other metrics with badge indicators for quick interpretation
  • Multi-Data Type Support: Analyze numeric (continuous), categorical (discrete), or boolean (binary) scores with type-appropriate visualizations
  • Matched vs All Analysis: Toggle between matched data to measure alignment or view all data for coverage and individual distributions
  • Temporal Insights: Track score evolution over time with configurable intervals to identify quality regressions or improvements

Use Cases: Validate LLM judge reliability, measure human-AI annotation agreement, identify coverage gaps, spot quality regressions, and discover feature relationships through score comparison.

Nov 6, 2025

Add human annotations while reviewing experiment results side-by-side. You can now annotate traces directly from the experiment compare view, streamlining the workflow of running experiments and adding human feedback.

Key Features:

  • Select any cell in the compare view to open the annotation side panel
  • Assign scores and leave comments while maintaining full experiment context
  • Use annotation score data to compare experiment results across different prompt versions and model configurations
  • Optimistic UI updates provide immediate feedback while data persists in the background
  • Summary metrics in the compare view reflect annotations as you work

Score Configurations: Support for numerical scores (with min/max ranges), categorical scores (custom classifications), and binary scores (pass/fail judgments).

Workflow:

  1. Run an experiment via UI or SDK
  2. Open the experiment comparison view
  3. Click any item to open the annotation panel
  4. Assign scores and add comments
  5. Move to the next item for review

Standardized score configs ensure consistency across experiments and team members.

Experiment compare view now supports baseline designation. Select two experiment runs, click Compare, and set one as baseline to enable side-by-side analysis of baseline versus candidate performance.

Key Features:

  • Matched rows: Each row displays baseline and candidate outputs for the same dataset item using stable identifiers for direct comparison
  • Visual indicators: Green/red deltas for scores, cost, and latency highlight item-level changes
  • Column headers: Summary stats show aggregate performance differences between baseline and candidate
  • Trace access: Click any row to open execution traces and debug behavioral changes
  • Regression hunting: Use column filters to build regression worklists (e.g., filter by score thresholds or performance deltas)
  • Aggregate metrics: Charts tab shows high-level metric summaries comparing quality scores, cost, and latency distributions
  • Annotation support: Classify failures with structured scores using annotation mode

Getting Started: Run two experiment versions using the same dataset, select both runs and click Compare, designate the production version as baseline, and review metrics in Charts tab or drill into item-level differences in Outputs tab.

The experiment compare view now supports filtering. Use filters to narrow down results based on specific criteria, such as evaluator scores, cost, latency, or other metrics. For instance, filter to show only items where your evaluator scores dropped below a threshold, making it easy to identify and address problematic cases. Drill down into your data faster by filtering to show only specific experiment items.

Nov 5, 2025

Launch Week 4 release introducing comprehensive agent tracing capabilities. Features include:

Agent Tools – Tool calls are now rendered at the top of each generation, showing all available tools to the LLM. Click any tool to see its full definition, description, and parameters. In the Chat UI, called tools are displayed alongside their arguments and call IDs with numbering that matches the available tools list.

Trace Log View – New Log View for traces displaying a single concatenation of all observations. Easily skim every agent step by scrolling and use CMD/Ctrl+F to search through entire traces, particularly useful for complex looping agents.

Observation Types – Expanded observation types to bring more meaning to spans, allowing easy identification of action types such as tool calls, embeddings, and agents.

Agent Graphs GA – Agent graphs are now generally available and work with any agent framework or custom instrumentation. The system infers graph structure from observation timings and nesting to visualize true execution flow, especially useful for complex looping observations.

Nov 4, 2025

Day 2 of Launch Week 4 brings a new integration with Amazon Bedrock AgentCore, enabling comprehensive observability for AI agents deployed on AWS infrastructure.

Trace AI agents built with Amazon Bedrock AgentCore via OpenTelemetry and Langfuse. The integration supports distributed tracing to connect traces from local development to production AgentCore deployments.

Key capabilities:

  • Monitor complete agent execution flows in production
  • Track LLM calls with token counts and costs
  • Debug tool usage and MCP interactions
  • Analyze latency metrics at each step
  • Maintain trace continuity across distributed systems

Includes a comprehensive example repository demonstrating a complete continuous evaluation loop with AgentCore and Langfuse, covering experimentation, QA testing, and production monitoring.

Tag teammates with @mentions to notify them instantly, and add emoji reactions to comments for quick acknowledgments. Comments now support @mentions and emoji reactions, making it easier to collaborate with your team directly in Langfuse.

New features:

  • @mention autocomplete: Type @ in any comment to see a list of your project members and tag them instantly
  • Email notifications: Mentioned teammates receive email notifications with context about the comment and object
  • Emoji reactions: Add quick reactions to comments to acknowledge feedback, agree with findings, or show appreciation
  • Notification preferences: Control when you receive email notifications for mentions on a per-project basis

Langfuse now supports IdP-initiated SSO, allowing users to start authentication directly from their identity provider (e.g., Okta, Azure AD, Keycloak, JumpCloud). This enables a seamless authentication experience where users can click on the Langfuse application tile in their identity provider's dashboard and be automatically authenticated.

How It Works: When configuring IdP-initiated SSO, you set up your identity provider to redirect users to <YOUR_LANGFUSE_INSTANCE_URL>/auth/sso-initiate?provider=<PROVIDER>. Langfuse automatically detects the provider and initiates the SSO authentication flow.

Getting Started: On Langfuse Cloud, contact support to configure IdP-initiated SSO. When self-hosting, see the SSO documentation for configuration instructions. IdP-initiated SSO is available on version >=v3.126.0 of Langfuse.

Langfuse now integrates with Mixpanel to send LLM-related product metrics into your existing Mixpanel dashboards. Configure the integration in project settings by selecting your Mixpanel region and providing your Project Token. When activated, Langfuse sends metrics related to traces, generations, and scores to Mixpanel.

Key features:

  • Combines regular product analytics with LLM-specific metrics from Langfuse
  • Historical data synced to Mixpanel with automatic hourly updates (30-minute delay)
  • Enables analysis of user engagement with LLM features, retention impact, and conversion correlation
  • Example dashboard available using Mixpanel's AI Company KPIs template
  • Similar integration also available for PostHog users
Latest
v4
Tracking Since
Nov 4, 2025
Last fetched Apr 13, 2026