Connect Bedrock-hosted OpenAI models to Langfuse LLM Connections with OpenAI Responses API support.
Langfuse Changelog
The hosted Langfuse MCP server now includes new tools for working with observations, metrics, scores, score configs, datasets, dataset items, dataset runs, dataset run items, comments, annotation queues, models, media, and health. Fifteen tool categories total.
Agents such as Claude Code, Linear Agents, or custom internal tools can now investigate a production issue, pull the relevant observation, query metrics, drop a comment for the team, create a score, or stage a dataset item for regression testing, all without leaving the chat.
The MCP server complements the Langfuse Skill (Day 2 of Launch Week 5) and the Langfuse CLI. Use the CLI when your agent can run bash and pre-filter data; use the MCP server when it cannot. Restrict to read-only by allow-listing lookup tools if you don't want writes.
Run deterministic Python or TypeScript checks on observations and experiments in Langfuse.
Not every evaluation needs an LLM. JSON parseability, schema validation, exact match, required tool arguments, custom business rules — things you would rather verify with code than ask an LLM to "rate this 1–5". Deterministic, reproducible, no token cost.
Write a small evaluate function in Python or TypeScript directly in the Langfuse UI, attach it to live observations or to a dataset experiment, and the result lands as a native Langfuse score. It shows up in trace views, experiment compares, filters, dashboards, and Score Analytics next to your existing scores.
Code evaluators sit alongside LLM-as-a-Judge rather than replacing it. Code wins for objective checks. A judge wins for semantic quality, tone, helpfulness, or rubric reasoning.
Langfuse Cloud is rolling out ClickHouse full-text search, improving UI search and adding the matches operator to Observations API v2.
Large input/output searches that took 18 seconds and scanned 494 GB now return in under half a second and read less than a gigabyte. Metadata-heavy queries dropped from 1.6s to 0.2s. The UI gets faster for humans hunting a bug, and the new matches operator on Observations API v2 gives agents and scripts the same token-based search programmatically.
Built on top of ClickHouse's new full-text search release.
Your agent's playbook for production-ready LLM apps.
The Langfuse Skill lets you hand your AI coding agent a playbook for working with Langfuse. It teaches Claude Code, Cursor, Codex, etc. how to instrument an app, query traces, manage prompts, and set up evaluators. Drop it into your editor, then describe the job in plain language and the agent runs with it.
The LLM-as-a-Judge calibration skill with Codex can produce a full analysis with accuracy, F1, precision, recall, and cost, all graphed directly in the new Langfuse Experiments view.
Five days. Five drops. New building blocks for taking AI applications from prototype to production. Unveiled live at ClickHouse OpenHouse (May 25–29, 2026).
- Day 01 (Mon May 25): Experiments in CI/CD — Run Langfuse experiments inside GitHub Actions, fail workflows on score regressions, post results to PRs.
- Day 02 (Tue May 26): Langfuse agent skill — A playbook for coding agents (Claude Code, Cursor, Codex) to instrument apps, query traces, manage prompts, and set up evaluators.
- Day 03 (Wed May 27): Full-Text Search — ClickHouse-powered full-text search on Langfuse Cloud; 18s queries now return in <0.5s.
- Day 04 (Thu May 28): Code evaluators — Write Python or TypeScript
evaluatefunctions in the UI for deterministic, reproducible checks. - Day 05 (Fri May 29): Langfuse MCP, expanded — MCP server now covers observations, metrics, scores, datasets, comments, annotation queues, and more (15 tool categories).
Run Langfuse experiments in GitHub Actions to catch quality regressions before releasing changes to production.
The new GitHub Action tests every pull request against a Langfuse dataset, fails the workflow when scores drop below the threshold you set, and posts the result back to the PR as a comment. Every run is tracked in Langfuse so you can dig into regressions later.
Blob storage, PostHog, and Mixpanel exports default to enriched observations for new Cloud projects
↗New Langfuse Cloud projects use enriched observations for blob storage, PostHog, and Mixpanel exports. The legacy traces/observations sources stay available on existing projects and self-hosted deployments.
Use your ClickHouse Cloud account to sign in to Langfuse Cloud, and link it to an existing Langfuse account.
Pick which field groups are written to each row and enable gzip compression in scheduled S3, GCS, and Azure exports. Shrink files and drop fields you don't want to land in your warehouse.
Fetch a trace's tags, release, and trace name directly on each observation row via the v2 observations API.
Langfuse Academy is an open explanation of the AI engineering lifecycle: tracing, monitoring, datasets, experiments, and evaluation, and how the pieces fit together. The structure follows the AI engineering lifecycle as a continuous loop, explaining why each step exists and how the steps connect.
Organization admins on Langfuse Cloud can now verify domains and configure Enterprise SSO directly in settings.
A new dedicated cloud region in Tokyo keeps traces, prompts, and evaluation data inside Japan.
A new dedicated cloud region in Tokyo keeps traces, prompts, and evaluation data inside Japan.
Use Amazon Bedrock API keys to connect Bedrock models to Langfuse for Playground, LLM-as-a-judge evaluations, and prompt experiments.
Experiments now live alongside Datasets as their own top-level feature—run them with or without datasets, compare across runs, and track progress over time.
Rebuilt experiment screens with faster loading, standalone access, and enhanced filtering for efficient analysis.
Key improvements:
- Faster loading and filtering leveraging observation-centric data model
- Standalone experiments that don't require linked datasets; SDK-based experiments now visible in UI
- Polished UI with visual deltas on scores, cost, and latency, baseline comparison, and score threshold filtering
Run A/B tests between model versions, compare evaluation scores across prompt variants, or triage regressions with quicker feedback loops. Currently in open beta on Langfuse Cloud only.
Capture open-ended feedback and qualitative annotations with the new TEXT score type.
LLM-as-a-Judge evaluators can now return boolean scores for true / false decisions. This makes it easier to model simple binary decisions directly as native boolean scores and analyze them across existing score tooling.
Key features:
- Choose
Booleanwhen creating a custom LLM-as-a-Judge evaluator - Store
true/falseoutcomes as native boolean scores - Analyze boolean evaluator outputs in dashboards, filters, and score analytics
Use cases: Detect User Disagreement, Out-of-Scope Requests, or Insufficient Answers as true/false decisions. Boolean scores complement existing numeric and categorical score types.

