{"id":"src_QDiFJlMYU0vB47IZrVUsM","slug":"langfuse-changelog","name":"Langfuse Changelog","type":"scrape","url":"https://langfuse.com/changelog","orgId":"org_r6pDl7yj_IIcLhhqfS0nx","org":{"slug":"langfuse","name":"Langfuse"},"isPrimary":false,"metadata":"{\"evaluatedMethod\":\"scrape\",\"evaluatedAt\":\"2026-04-07T17:18:16.941Z\",\"noFeedFound\":true,\"lastCrawlJobId\":\"15ef2c78-babf-4a4e-8ae8-f426265c2473\",\"crawlEnabled\":true,\"crawlPattern\":\"https://langfuse.com/changelog/**\",\"lastCrawlAt\":\"2026-04-13T21:00:35.107Z\"}","releaseCount":29,"releasesLast30Days":3,"avgReleasesPerWeek":0.6,"latestVersion":"v4","latestDate":"2026-04-13T00:00:00.000Z","changelogUrl":null,"hasChangelogFile":false,"lastFetchedAt":"2026-04-13T21:01:30.289Z","trackingSince":"2025-11-04T00:00:00.000Z","releases":[{"id":"rel_lwhIl8PyMSxSh880ZUlmM","version":"v4","title":"Compare Experiments Faster","summary":"Rebuilt experiment screens with faster loading, standalone access, and enhanced filtering for efficient analysis.\n\n**Key improvements:**\n- Faster load...","content":"Rebuilt experiment screens with faster loading, standalone access, and enhanced filtering for efficient analysis.\n\n**Key improvements:**\n- Faster loading and filtering leveraging observation-centric data model\n- Standalone experiments that don't require linked datasets; SDK-based experiments now visible in UI\n- Polished UI with visual deltas on scores, cost, and latency, baseline comparison, and score threshold filtering\n\nRun A/B tests between model versions, compare evaluation scores across prompt variants, or triage regressions with quicker feedback loops. Currently in open beta on Langfuse Cloud only.","publishedAt":"2026-04-13T00:00:00.000Z","url":"https://langfuse.com/changelog/2026-04-13-experiments-rebuild","media":[]},{"id":"rel_UwwoPE0WHK7Wwh2UubsZT","version":null,"title":"Boolean LLM-as-a-Judge Scores","summary":"LLM-as-a-Judge evaluators can now return boolean scores for `true` / `false` decisions. This makes it easier to model simple binary decisions directly...","content":"LLM-as-a-Judge evaluators can now return boolean scores for `true` / `false` decisions. This makes it easier to model simple binary decisions directly as native boolean scores and analyze them across existing score tooling.\n\n**Key features:**\n* Choose `Boolean` when creating a custom LLM-as-a-Judge evaluator\n* Store `true` / `false` outcomes as native boolean scores\n* Analyze boolean evaluator outputs in dashboards, filters, and score analytics\n\n**Use cases:** Detect User Disagreement, Out-of-Scope Requests, or Insufficient Answers as true/false decisions. Boolean scores complement existing numeric and categorical score types.","publishedAt":"2026-04-08T00:00:00.000Z","url":"https://langfuse.com/changelog/2026-04-08-boolean-llm-as-a-judge-scores","media":[{"type":"image","url":"https://langfuse.com/images/changelog/2026-04-08-boolean-llm-as-a-judge-scores.jpg","alt":"Boolean LLM-as-a-Judge Scores","r2Key":"sources/langfuse-changelog/9aa5056bdfa8ec69.jpg","r2Url":"https://media.releases.sh/sources/langfuse-changelog/9aa5056bdfa8ec69.jpg"}]},{"id":"rel_ko3g5QGDi9_m4pNfsP-BU","version":"4","title":"Updates to Dashboards","summary":"Detailed reference for how dashboards behave differently in Langfuse v4 with the new observation-centric data model. Key changes include:\n\n- **Trace c...","content":"Detailed reference for how dashboards behave differently in Langfuse v4 with the new observation-centric data model. Key changes include:\n\n- **Trace counts** are now computed as `uniq(trace_id)` from the wide observations table instead of counting rows in the traces table\n- **\"Traces by time\"** is replaced by **\"Observations by time\"**, reflecting the shift to observations as the primary unit\n- **Score histograms** are now computed server-side, including all scores regardless of dataset size\n- **Trace names** fall back to the root observation name for OTEL-native ingestion, so previously unnamed traces now display correctly\n- **High-cardinality dimensions** (like userId) now require top-N queries with explicit limits\n- **NULL and empty strings** are treated as equivalent in filters\n\n[Read the full reference page](https://langfuse.com/changelog/2026-03-23-v4-dashboard-changes/docs/metrics/v4-dashboard-changes) for all changes and expected numerical differences when switching to the new observation-centric data model.","publishedAt":"2026-03-23T00:00:00.000Z","url":"https://langfuse.com/changelog/2026-03-23-v4-dashboard-changes","media":[{"type":"image","url":"https://langfuse.com/images/changelog/2026-03-23-v4-dashboard-changes.jpg","alt":"Updates to Dashboards","r2Key":"sources/langfuse-changelog/0af20b11f49e30c6.jpg","r2Url":"https://media.releases.sh/sources/langfuse-changelog/0af20b11f49e30c6.jpg"}]},{"id":"rel_j3aQvfkOQjlzQa2BcSuZQ","version":null,"title":"Categorical LLM-as-a-Judge Scores","summary":"LLM-as-a-Judge evaluators in Langfuse can now return categorical scores in addition to numeric ones. You can define a fixed set of allowed categories ...","content":"LLM-as-a-Judge evaluators in Langfuse can now return categorical scores in addition to numeric ones. You can define a fixed set of allowed categories in the evaluator template, have the judge choose from them, and store the result as a native categorical score in Langfuse.\n\nThis is especially useful when the right answer is a label instead of a gradient:\n* Classify answers as `correct`, `partially_correct`, or `incorrect`\n* Mark support replies as `resolved`, `needs_follow_up`, or `escalate`\n* Label safety outcomes as `safe`, `needs_review`, or `blocked`\n\n**What's New:**\n* Choose `Numeric` or `Categorical` when creating a custom LLM-as-a-Judge evaluator\n* Define the allowed category values directly in the evaluator template\n* Optionally allow multiple matches when more than one label applies; Langfuse creates one score per selected category\n* View categorical results in evaluator logs and reuse them across Langfuse's existing score tooling","publishedAt":"2026-03-20T00:00:00.000Z","url":"https://langfuse.com/changelog/2026-03-20-categorical-llm-as-a-judge-scores","media":[{"type":"image","url":"https://langfuse.com/images/changelog/2026-03-20-categorical-llm-as-a-judge-scores.jpg","alt":"Categorical LLM-as-a-Judge Scores","r2Key":"sources/langfuse-changelog/b25f029d4d2a395d.jpg","r2Url":"https://media.releases.sh/sources/langfuse-changelog/b25f029d4d2a395d.jpg"}]},{"id":"rel_1GPGeQwxFlaGJXbL-c2sG","version":"v4","title":"Simplify Langfuse for Scale (Fast Preview v4)","summary":"Langfuse rolled out a simplified architecture with significantly faster product performance at scale. The main table is now **observations** — every L...","content":"Langfuse rolled out a simplified architecture with significantly faster product performance at scale. The main table is now **observations** — every LLM call, tool execution, and agent step is a row you can query directly, replacing the previous trace-centric view.\n\n**Key Changes:**\n- Unified observations table: all inputs, outputs, and context attributes live directly on observations for faster filtering and aggregation\n- Observations API v2 and Metrics API v2 designed for faster querying at scale\n- Observation-level evaluations now execute in seconds\n- Charts, filters, and APIs are much faster across Langfuse Cloud\n- trace_id works like any other filter column (session_id, user_id, score) to group related observations\n\n**What's Faster:**\n- Chart loading time significantly shorter; confident loading over large time ranges\n- Faster browsing of traces, users, and sessions\n- Quicker filter responses in large projects\n- Faster evaluation workflows\n\n**Migration Required:**\n- Upgrade to Python SDK v4 and JS/TS SDK v5 to avoid delays and see data in real time\n- Use saved views and filters to focus on key operations across your observations table\n\nSee [Fast Preview Docs](https://langfuse.com/docs/v4) for rollout details, access, and migration steps.","publishedAt":"2026-03-10T00:00:00.000Z","url":"https://langfuse.com/changelog/2026-03-10-simplify-for-scale","media":[{"type":"image","url":"https://langfuse.com/images/changelog/2026-03-10-v4-preview/v4-banner.png","alt":"Simplify Langfuse for Scale","r2Key":"sources/langfuse-changelog/3037a8d85c902dec.png","r2Url":"https://media.releases.sh/sources/langfuse-changelog/3037a8d85c902dec.png"}]},{"id":"rel_p6wjiQyngICb1kn4m5KpT","version":null,"title":"Langfuse CLI","summary":"Fully use Langfuse from the CLI. Built for AI agents and power users.\n\nA new CLI that wraps the entire Langfuse API, allowing agents and humans to int...","content":"Fully use Langfuse from the CLI. Built for AI agents and power users.\n\nA new CLI that wraps the entire Langfuse API, allowing agents and humans to interact with Langfuse directly from the terminal. Covers all API functionality including traces, observations, prompts, datasets, scores, sessions, metrics, and more.\n\n**Key capabilities:**\n- Get started with `npx langfuse-cli api <resource> <action>`\n- Supports 26 resources with dynamic commands wrapping the full OpenAPI spec\n- Progressive disclosure for discovery and usage\n- Authentication via environment variables or .env file\n- Designed to work with Agent Skills for seamless agent integration\n\n**Use cases:**\n- Let coding agents manage Langfuse directly from the editor\n- Script workflows and automate tasks in CI/CD and bash scripts\n- Faster than the UI for quick lookups and queries","publishedAt":"2026-02-17T00:00:00.000Z","url":"https://langfuse.com/changelog/2026-02-17-langfuse-cli","media":[{"type":"image","url":"https://langfuse.com/images/changelog/2026-02-17-langfuse-cli.jpg","alt":"Langfuse CLI","r2Key":"sources/langfuse-changelog/b4837d6f3b4e29c3.jpg","r2Url":"https://media.releases.sh/sources/langfuse-changelog/b4837d6f3b4e29c3.jpg"}]},{"id":"rel_eIfE7jfvrT6oEAtupfcZW","version":null,"title":"Evaluate Individual Operations: Faster, More Precise LLM-as-a-Judge","summary":"Observation-level evaluations enable precise operation-specific scoring for production monitoring. LLM-as-a-Judge evaluations can now be run on indivi...","content":"Observation-level evaluations enable precise operation-specific scoring for production monitoring. LLM-as-a-Judge evaluations can now be run on individual observations—LLM calls, retrievals, tool executions, or any operation within your traces. Previously, evaluations could only be run on entire traces.\n\n**Key Features:**\n- Evaluate only specific operations (e.g., final generation for helpfulness, retrieval step for relevance) rather than entire workflows\n- Each operation gets its own score attached to specific observations in the trace tree\n- Filter by observation type, name, metadata, or trace-level attributes to target exactly what matters\n- Run different evaluators on different operations simultaneously\n\n**Benefits:**\n- **Operation-level precision**: Target final LLM responses, retrieval steps, or specific tool calls, reducing evaluation volume and cost\n- **Compositional evaluation**: Stack observation filters (type, name, metadata) with trace attributes filters (userId, sessionId, tags)\n- **Scalable architecture**: Built for high-volume workloads with fast, reliable evaluation at scale\n\n**Requirements:**\n- SDK version: Python v3+ (OTel-based) or JS/TS v4+ (OTel-based)\n- Use propagate_attributes() in instrumentation to filter observations by trace-level attributes\n\nExisting trace-level evaluators continue to work, with an upgrade guide available for migration.","publishedAt":"2026-02-13T00:00:00.000Z","url":"https://langfuse.com/changelog/2026-02-13-observation-level-evals","media":[{"type":"image","url":"https://langfuse.com/images/changelog/2026-02-13-observation-evals.jpg","alt":"Evaluate Individual Operations: Faster, More Precise LLM-as-a-Judge","r2Key":"sources/langfuse-changelog/5b653e758460316c.jpg","r2Url":"https://media.releases.sh/sources/langfuse-changelog/5b653e758460316c.jpg"},{"type":"image","url":"https://langfuse.com/images/changelog/2026-02-13-observation-level-evals.png","alt":"Observation-level evaluation in trace tree","r2Key":"sources/langfuse-changelog/26dcfe649d1be13a.png","r2Url":"https://media.releases.sh/sources/langfuse-changelog/26dcfe649d1be13a.png"}]},{"id":"rel_vLWDvmrOkwR2dqzeHi7S6","version":null,"title":"Run Experiments on Versioned Datasets","summary":"Fetch datasets at specific version timestamps and run experiments directly on versioned datasets across UI, API, and SDKs for full reproducibility.\n\n#...","content":"Fetch datasets at specific version timestamps and run experiments directly on versioned datasets across UI, API, and SDKs for full reproducibility.\n\n## Why versioned experiments matter\n\n* **Full reproducibility**: Re-run experiments on the exact dataset state from any point in time, even after items are updated or deleted. Reproduce results from weeks or months ago with complete confidence.\n* **A/B testing with confidence**: Compare model performance before and after dataset refinements. Test new prompts against the same baseline dataset version that your production model was evaluated on.\n* **Regression testing**: Run experiments on a specific dataset version while your team continues improving the dataset. Ensure new model versions don't regress on established benchmarks.\n\n## Fetch datasets at specific versions\n\nRetrieve datasets as they existed at any timestamp via Python SDK, JS/TS SDK, or Langfuse UI. By default, APIs return the latest version. Navigate to **Datasets** → Select dataset → **Items Tab** → Toggle **Version view** to browse all historical versions.\n\n## Run experiments on versioned datasets\n\nExecute experiments against specific dataset versions using the experiment runner or via UI. When running experiments in the UI: (1) Navigate to **Run Prompt Experiment**, (2) Select your dataset, (3) Choose a version from the **Dataset Version** dropdown, (4) The experiment runs against that specific dataset state, (5) If no version is selected, runs against latest version.\n\nThis completes the dataset versioning feature released in December.","publishedAt":"2026-02-11T00:00:00.000Z","url":"https://langfuse.com/changelog/2026-02-11-versioned-dataset-experiments","media":[{"type":"image","url":"https://langfuse.com/images/changelog/2026-02-11-experiments-versioned-datasets.jpg","alt":"Run Experiments on Versioned Datasets","r2Key":"sources/langfuse-changelog/0f232ff3dcd1a541.jpg","r2Url":"https://media.releases.sh/sources/langfuse-changelog/0f232ff3dcd1a541.jpg"}]},{"id":"rel_ctH9GIODQ020bBhODs1Os","version":null,"title":"Corrected Outputs for Traces and Observations","summary":"Added the ability to capture improved versions of LLM outputs directly in trace views. Domain experts can now add corrected outputs to traces and obse...","content":"Added the ability to capture improved versions of LLM outputs directly in trace views. Domain experts can now add corrected outputs to traces and observations, with diff views comparing original and corrected outputs. Corrections are accessible via the API as scores with `dataType: \"CORRECTION\"`. Key capabilities include:\n\n- **Human-in-the-loop improvement**: Domain experts review production outputs and provide corrections\n- **Fine-tuning data at scale**: Export corrected outputs alongside original inputs to create high-quality training datasets from real production data\n- **Quality benchmarking**: Compare actual vs expected outputs across production traces\n- **JSON validation and plain text modes**: Toggle between modes to match your data format\n- **Use cases**: Customer support agent responses, content generation preferences, code generation fixes, structured extraction examples","publishedAt":"2026-01-14T00:00:00.000Z","url":"https://langfuse.com/changelog/2026-01-14-corrected-outputs","media":[{"type":"image","url":"https://langfuse.com/images/changelog/2026-01-15-corrected-output.jpg","alt":"Corrected Outputs for Traces and Observations","r2Key":"sources/langfuse-changelog/4fda34ca8fc0eee1.jpg","r2Url":"https://media.releases.sh/sources/langfuse-changelog/4fda34ca8fc0eee1.jpg"}]},{"id":"rel_4U9MHKnYXXZ9tS0Ca6kv5","version":null,"title":"Inline Comments on Observation I/O","summary":"Anchor comments to specific text selections within trace and observation input, output, and metadata fields. Similar to Google Docs, you can now selec...","content":"Anchor comments to specific text selections within trace and observation input, output, and metadata fields. Similar to Google Docs, you can now select text in input, output, or metadata fields, click the hovering comment button, and add comments anchored to that selection. Text-specific comments make it easy to discuss parts of LLM responses and assign them to people via mentions. This feature is available in the \"JSON Beta\" view.","publishedAt":"2026-01-07T00:00:00.000Z","url":"https://langfuse.com/changelog/2026-01-07-inline-comments-on-trace-io","media":[{"type":"image","url":"https://langfuse.com/images/changelog/2026-01-07-inline-io-comments.jpg","alt":"Inline Comments on Observation I/O","r2Key":"sources/langfuse-changelog/46e41c43a82e866b.jpg","r2Url":"https://media.releases.sh/sources/langfuse-changelog/46e41c43a82e866b.jpg"}]},{"id":"rel_AbbfM2FATLkeDo5NDliTx","version":null,"title":"Filter Observations by Tool Calls and add Tool Calls to Dashboard Widgets","summary":"Add filtering, table columns, and dashboard widgets for analyzing tool usage in your LLM applications. Tool calls are now a distinct data structure in...","content":"Add filtering, table columns, and dashboard widgets for analyzing tool usage in your LLM applications. Tool calls are now a distinct data structure in Langfuse, allowing you to filter observations by available and called tools and build dashboards for tool calls over time.\n\n**Tool Call Filters on the Observations Table:**\n- Filter by Available Tools (count of tools available to an LLM)\n- Filter by Tool Calls (count of tools invoked by an LLM)\n- Filter by Available Tools Names (list of specific tool names available)\n- Filter by Tool Calls Names (list of specific tool names invoked)\n\n**Dashboard Widgets for Tool Calls:**\n- New Metrics: `toolDefinitions` and `toolCalls` with user-selectable aggregation (SUM, AVG, MAX, MIN, percentiles)\n- New Filters: `toolNames` to break down metrics by tool name\n\n**Supported Frameworks:** OpenAI, Langchain/LangGraph, Vercel AI SDK, Google ADK/Vertex AI, Microsoft Agent Framework. Pydantic AI and additional frameworks coming in future releases.\n\n**Note:** This feature only works with traces ingested after this release and does not apply to historical data.","publishedAt":"2025-12-22T00:00:00.000Z","url":"https://langfuse.com/changelog/2025-12-22-tool-calls-filtering-visualization","media":[{"type":"image","url":"https://langfuse.com/images/changelog/2025-12-22-tool-calls-tables.jpg","alt":"Filter Observations by Tool Calls and add Tool Calls to Dashboard Widgets","r2Key":"sources/langfuse-changelog/3c2598e64b750dde.jpg","r2Url":"https://media.releases.sh/sources/langfuse-changelog/3c2598e64b750dde.jpg"}]},{"id":"rel_NyTFRbnoG07I7NigfJ57b","version":"2","title":"v2 Metrics and Observations API (Beta)","summary":"New high-performance v2 APIs for metrics and observations with cursor-based pagination, selective field retrieval, and optimized data architecture.\n\n*...","content":"New high-performance v2 APIs for metrics and observations with cursor-based pagination, selective field retrieval, and optimized data architecture.\n\n**Why v2?** The v1 `/public/traces` and `/public/observations` endpoints have been among the most resource-intensive APIs to serve. v2 addresses performance issues: no single way to request partial data, offset pagination doesn't scale, and JSON parsing overhead. v2 is built on an optimized immutable data model that requires fewer joins and eliminates deduplication at query time.\n\n**Metrics API v2** (`GET /api/public/v2/metrics`) - Built on optimized data model with significantly faster query performance. The `traces` view is no longer available in v2; use the `observations` view instead. Default limit of 100 rows per query. High cardinality dimensions like `id`, `traceId`, `userId`, and `sessionId` can no longer be used for grouping in v2, though remain available for filtering.\n\n**Observations API v2** (`GET /api/public/v2/observations`) - Redesigned endpoint with selective field retrieval (specify field groups: `core`, `basic`, `io`, `usage`), cursor-based pagination, optimized I/O handling (returns strings by default; set `parseIoAsJson: true` when needed), and stricter limits (default 50, max 1,000).\n\n**Migration Notes:** v2 APIs are additive - v1 endpoints remain available. Update API calls to use `/api/public/v2/` prefix, use `fields` parameter to specify field groups, replace page-based pagination with cursor-based pagination, and note that `parseIoAsJson` defaults to `false` in v2 (v1 always parsed as JSON).\n\n**Status:** Cloud-only (Beta), currently in beta on Langfuse Cloud. Self-hosted migration path in development. With current SDK versions, data may take ~5 minutes to appear on v2 endpoints; updated SDK versions coming soon.","publishedAt":"2025-12-17T00:00:00.000Z","url":"https://langfuse.com/changelog/2025-12-17-v2-metrics-and-observations-api","media":[{"type":"image","url":"https://langfuse.com/images/changelog/2025-12-17-v2-Metrics-and-Observations-API-Betacl.jpg","alt":"v2 Metrics and Observations API (Beta)","r2Key":"sources/langfuse-changelog/a51e469c8b15b12a.jpg","r2Url":"https://media.releases.sh/sources/langfuse-changelog/a51e469c8b15b12a.jpg"}]},{"id":"rel_EqP49W4dZQKSmKhxASNV9","version":"2025-12-15","title":"Dataset Item Versioning","summary":"Track dataset changes over time with automatic versioning on every addition, update, or deletion of dataset items.\n\n**Complete audit trail**: View ful...","content":"Track dataset changes over time with automatic versioning on every addition, update, or deletion of dataset items.\n\n**Complete audit trail**: View full history of changes at item level to understand what changed and when. Identify unintended edits and revert problematic changes.\n\n**Experiment reproducibility**: Experiments are automatically tied to the exact dataset state at run time. When dataset items are modified after running an experiment, previous results remain tied to the dataset version they actually ran against.\n\n**Dataset evolution**: Track how gold-label datasets improve as domain experts refine expected outputs. See exactly what changed and how it affects benchmark results.\n\nEvery addition, update, or deletion of dataset items creates a new dataset version identified by timestamp. Includes item-level versioning with full history and diffs, and dataset-level metadata tracking high-level changes.\n\nComing soon: API support for fetching datasets at specific version timestamps and SDK support for running experiments on specific dataset versions.","publishedAt":"2025-12-15T00:00:00.000Z","url":"https://langfuse.com/changelog/2025-12-15-dataset-versioning","media":[{"type":"image","url":"https://langfuse.com/images/changelog/2025-12-15-dataset-versioning.jpg","alt":"Dataset Item Versioning","r2Key":"sources/langfuse-changelog/83cade38b470b980.jpg","r2Url":"https://media.releases.sh/sources/langfuse-changelog/83cade38b470b980.jpg"}]},{"id":"rel_piuxyN3YqEYl-IQHj8m0b","version":"GPT-5.2","title":"OpenAI GPT-5.2 support","summary":"Langfuse now supports OpenAI GPT-5.2 with day 1 support across all major features. OpenAI has released GPT-5.2, bringing enhanced capabilities and per...","content":"Langfuse now supports OpenAI GPT-5.2 with day 1 support across all major features. OpenAI has released GPT-5.2, bringing enhanced capabilities and performance improvements to their model lineup.\n\nStart experimenting with GPT-5.2 in the Langfuse LLM playground, leverage it to power LLM-as-a-judge evaluations, and monitor usage with comprehensive cost tracking capabilities.","publishedAt":"2025-12-12T00:00:00.000Z","url":"https://langfuse.com/changelog/2025-12-12-openai-gpt-5-2-support","media":[{"type":"image","url":"https://langfuse.com/images/changelog/2025-12-12-gpt-52.jpg","alt":"OpenAI GPT-5.2 support","r2Key":"sources/langfuse-changelog/67f61899d5cb016a.jpg","r2Url":"https://media.releases.sh/sources/langfuse-changelog/67f61899d5cb016a.jpg"}]},{"id":"rel_bBbI4VEzX-iroIHvuYylm","version":null,"title":"Batch Add Observations to Datasets","summary":"Select multiple observations from the observations table and add them to a new or existing dataset with flexible field mapping.\n\n**Key Features:**\n- B...","content":"Select multiple observations from the observations table and add them to a new or existing dataset with flexible field mapping.\n\n**Key Features:**\n- Bulk add observations to datasets directly from the observations table\n- Build test datasets from production data by selecting multiple observations\n- Flexible field mapping system to control how observation data transforms into dataset items\n- Support for mapping entire fields as-is, extracting specific values using JSON path expressions, or building custom objects from multiple fields\n- Background processing with ability to continue working while operations complete\n- Partial success support: valid observations are added even if some fail schema validation\n- Monitor progress and view results in batch actions settings page","publishedAt":"2025-12-11T00:00:00.000Z","url":"https://langfuse.com/changelog/2025-12-11-batch-add-observations-to-dataset","media":[{"type":"image","url":"https://langfuse.com/images/changelog/2025-12-11-batch-add-observations-to-datasets.jpg","alt":"Batch Add Observations to Datasets","r2Key":"sources/langfuse-changelog/cd61b1a9eba3c85e.jpg","r2Url":"https://media.releases.sh/sources/langfuse-changelog/cd61b1a9eba3c85e.jpg"},{"type":"image","url":"https://langfuse.com/images/changelog/2025-12-11-fieldmapping.png","alt":"Field mapping","r2Key":"sources/langfuse-changelog/663c7d1fe63d37ee.png","r2Url":"https://media.releases.sh/sources/langfuse-changelog/663c7d1fe63d37ee.png"}]},{"id":"rel_azhX7lW4wjNRNC2LD0Hd1","version":null,"title":"Pricing Tiers for Accurate Model Cost Tracking","summary":"Langfuse now supports pricing tiers for models with context-dependent pricing, enabling accurate cost calculation. Some model providers charge differe...","content":"Langfuse now supports pricing tiers for models with context-dependent pricing, enabling accurate cost calculation. Some model providers charge different rates depending on the number of input tokens used. For example, Anthropic's Claude Sonnet 4.5 with 1M context window, Google's Gemini 2.5 Pro and Gemini 3 Pro Preview all apply higher pricing when more than 200K input tokens are used.\n\n**How it works:** Pricing tiers allow multiple price points for a single model, each with conditions that determine when that tier applies. Tiers are evaluated in priority order, and the first matching tier is used for cost calculation.\n\n**Pre-configured models:** The following models now have pricing tiers pre-configured:\n- claude-sonnet-4-5-20250929: Standard / Large Context tier with > 200K input tokens threshold\n- gemini-2.5-pro: Standard / Large Context tier with > 200K input tokens threshold\n- gemini-3-pro-preview: Standard / Large Context tier with > 200K input tokens threshold\n\n**Custom pricing tiers:** You can also define pricing tiers for your own custom models via the Langfuse UI or API. Each tier includes a name, priority, conditions that determine when the tier applies, and prices for cost per usage type.","publishedAt":"2025-12-02T00:00:00.000Z","url":"https://langfuse.com/changelog/2025-12-02-model-pricing-tiers","media":[{"type":"image","url":"https://langfuse.com/images/changelog/2025-12-02-model-pricing-tiers/tiered-model-cost.jpg","alt":"Pricing Tiers for Accurate Model Cost Tracking","r2Key":"sources/langfuse-changelog/7d69bd5f091314da.jpg","r2Url":"https://media.releases.sh/sources/langfuse-changelog/7d69bd5f091314da.jpg"}]},{"id":"rel_QuO-iyHE_gQXgX3GFuOpH","version":null,"title":"Hosted MCP Server for Langfuse Prompt Management","summary":"Langfuse now includes a native Model Context Protocol (MCP) server with write capabilities, enabling AI agents to fetch and update prompts directly.\n\n...","content":"Langfuse now includes a native Model Context Protocol (MCP) server with write capabilities, enabling AI agents to fetch and update prompts directly.\n\nThe native MCP server is available at `/api/public/mcp` and provides five tools for comprehensive prompt management:\n\n**Read Operations:**\n- `getPrompt` – Fetch specific prompts by name, label, or version\n- `listPrompts` – Browse all prompts with filtering and pagination\n\n**Write Operations:**\n- `createTextPrompt` – Create new text prompt versions\n- `createChatPrompt` – Create new chat prompt versions (OpenAI-style messages)\n- `updatePromptLabels` – Manage labels across prompt versions\n\nThe native server uses a stateless architecture built directly into the platform with BasicAuth credentials, replacing the previous node package version. Includes complete setup guide and prompt-specific workflows documentation.","publishedAt":"2025-11-20T00:00:00.000Z","url":"https://langfuse.com/changelog/2025-11-20-native-mcp-server","media":[{"type":"image","url":"https://langfuse.com/images/changelog/2025-11-20-native-mcp-server.png","alt":"Hosted MCP Server for Langfuse Prompt Management","r2Key":"sources/langfuse-changelog/5af30546381eacfd.png","r2Url":"https://media.releases.sh/sources/langfuse-changelog/5af30546381eacfd.png"}]},{"id":"rel_yBPayLjlwsoD7tvgWxT8p","version":null,"title":"OpenAI GPT-5.1 support","summary":"Langfuse now supports OpenAI GPT-5.1 with day 1 support for the LLM playground, LLM-as-a-judge evaluations, and comprehensive cost tracking. Start exp...","content":"Langfuse now supports OpenAI GPT-5.1 with day 1 support for the LLM playground, LLM-as-a-judge evaluations, and comprehensive cost tracking. Start experimenting with GPT-5.1 in the Langfuse LLM playground, leverage it for LLM-as-a-judge evaluations, and monitor usage with comprehensive cost tracking capabilities.","publishedAt":"2025-11-14T00:00:00.000Z","url":"https://langfuse.com/changelog/2025-11-14-openai-gpt-5-1-support","media":[]},{"id":"rel_HvKRQ0dRVjYpL0K8LYzx-","version":null,"title":"Organize Your Datasets in Folders","summary":"Use slashes in dataset names to create folders for better organization. Put a slash (/) in the dataset name to create a folder. For example, `folder1/...","content":"Use slashes in dataset names to create folders for better organization. Put a slash (/) in the dataset name to create a folder. For example, `folder1/dataset1` will create a folder named `folder1` and a dataset named `dataset1` inside it. The folders are virtual so the full dataset name remains `folder1/dataset1` if you want to access it via the SDK or API.","publishedAt":"2025-11-08T00:00:00.000Z","url":"https://langfuse.com/changelog/2025-10-27-dataset-folders","media":[{"type":"image","url":"https://langfuse.com/images/changelog/2025-10-27-dataset-folders.png","alt":"Organize Your Datasets in Folders","r2Key":"sources/langfuse-changelog/b3f19a4ffe700a9d.png","r2Url":"https://media.releases.sh/sources/langfuse-changelog/b3f19a4ffe700a9d.png"}]},{"id":"rel_Fod6H7eElHq_X10FnL4qX","version":"Launch Week 4","title":"JSON Schema Enforcement for Dataset Items","summary":"Define JSON schemas for your dataset inputs and expected outputs to ensure data quality and consistency across your test datasets.\n\n**Key Features:**\n...","content":"Define JSON schemas for your dataset inputs and expected outputs to ensure data quality and consistency across your test datasets.\n\n**Key Features:**\n- Add JSON Schema validation to datasets to ensure all items conform to expected structure\n- Automatic validation of dataset items against defined schemas\n- Detailed error messages for validation failures\n- Support for validation on creation, updates, and CSV imports\n\n**Benefits:**\n- Ensures data quality: All items must match defined structure\n- Catches errors early: Invalid data rejected before entering dataset\n- Improves collaboration: Shared schemas ensure consistent formatting\n- Type safety: Define exact structures for inputs and expected outputs\n- Prevents test failures: Experiments won't fail due to malformed test data\n\n**Use Cases:**\n- Validate JSON responses from function calling or tool use\n- Ensure message arrays follow correct format in chat applications\n- Maintain consistency when multiple team members contribute\n- Validate CSV uploads before adding items to datasets","publishedAt":"2025-11-08T00:00:00.000Z","url":"https://langfuse.com/changelog/2025-11-06-dataset-schema-enforcement","media":[{"type":"image","url":"https://langfuse.com/images/changelog/2025-11-06-dataset-schema-enforcement.png","alt":"JSON Schema Enforcement for Dataset Items","r2Key":"sources/langfuse-changelog/7db617f4cc9c6de8.png","r2Url":"https://media.releases.sh/sources/langfuse-changelog/7db617f4cc9c6de8.png"}]}],"pagination":{"page":1,"pageSize":20,"totalPages":2,"totalItems":29},"summaries":{"rolling":{"windowDays":90,"summary":"Langfuse shipped a major architectural shift toward observation-centric data modeling while expanding evaluation capabilities. The v4 release rebuilt the core table structure to treat individual operations—LLM calls, tool executions, agent steps—as first-class query units instead of nesting them under traces, unlocking significantly faster filtering and aggregation at scale. This foundation enabled observation-level evaluations that score specific operations rather than entire workflows, and boolean and categorical LLM-as-a-Judge scores that move beyond numeric gradients. Experiments got a speed overhaul with faster UI loading and the ability to run against versioned datasets for reproducibility, while new tools like the CLI and corrected outputs for human-in-the-loop workflows expanded how teams interact with production data.","releaseCount":9,"generatedAt":"2026-04-13T21:01:33.729Z"},"monthly":[{"year":2026,"month":3,"summary":"Langfuse shipped its v4 architecture redesign centered on observations as the primary data unit instead of traces, enabling significantly faster performance at scale. The unified observations table replaced the previous trace-centric model, with Observations API v2 and Metrics API v2 designed for quicker filtering and aggregation across dashboards, evaluations, and large datasets. LLM-as-a-Judge evaluators graduated to support categorical scores alongside numeric ones, letting you classify outcomes like `correct`/`partially_correct`/`incorrect` directly in templates, while dashboard computation shifted to observation-level metrics with server-side score histograms and high-cardinality dimension queries now requiring explicit top-N limits.","releaseCount":3,"generatedAt":"2026-04-13T21:01:36.217Z"}]}}