Estimated end-of-life date, accurate to within three months: 05-2027 See the support level definitions for more information.
Span.parent_id will change from Optional[int] to int in v5.0.0.Experiments now report their execution status to the backend. Status transitions to running when execution starts, completed on success, failed when tasks or evaluators error with raise_errors=False, and interrupted when the experiment is stopped by an exception. #16713
Adds LLMObs.publish_evaluator() to sync a locally-defined LLMJudge evaluator to the Datadog UI as a custom LLM-as-Judge evaluation.
Adds support for DeepEval evaluations in LLM Observability Experiments by allowing users to pass a DeepEval evaluation (which either inherents from BaseMetric or BaseConversationalMetric) in an LLM Obs Experiment.
Example:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
from ddtrace.llmobs import LLMObs
correctness_metric = GEval(
name="Correctness",
criteria="Determine whether the actual output is factually correct based on the expected output.",
evaluation_steps=[
"Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
"You should also heavily penalize omission of detail",
"Vague language, or contradicting OPINIONS, are OK"
],
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
async_mode=True
)
dataset = LLMObs.create_dataset(
dataset_name="<DATASET_NAME>",
description="<DATASET_DESCRIPTION>",
records=[RECORD_1, RECORD_2, RECORD_3, ...]
)
def my_task(input_data, config):
return input_data["output"]
def my_summary_evaluator(inputs, outputs, expected_outputs, evaluators_results):
return evaluators_results["Correctness"].count(True)
experiment = LLMObs.experiment(
name="<EXPERIMENT_NAME>",
task=my_task,
dataset=dataset,
evaluators=[correctness_metric],
summary_evaluators=[my_summary_evaluator], # optional, used to summarize the experiment results
description="<EXPERIMENT_DESCRIPTION>."
)
result = experiment.run()
adds experiment summary logging after run() with row count, run count, per-evaluator stats, and error counts.
adds max_retries and retry_delay parameters to experiment.run() for retrying failed tasks and evaluators. Example: experiment.run(max_retries=3, retry_delay=lambda attempt: 2 ** attempt).
This introduces LLMObs.get_prompt() to retrieve managed prompts from Datadog's Prompt Registry. The method returns a ManagedPrompt object with a format()
method for variable substitution. Prompt updates propagate to running applications within the cache TTL (default: 60 seconds).
Use with annotation_context or annotate to correlate prompts with LLM spans:
prompt = LLMObs.get_prompt("greeting")
variables = {"user": "Alice"}
with LLMObs.annotation_context(prompt=prompt.to_annotation_dict(**variables)):
openai.chat.completions.create(messages=prompt.format(**variables))
experiments propagate canonical_ids from dataset records to the corresponding experiments span when present. The canonical_ids are only guaranteed to be available after calling pull_dataset.
LLMObs.create_dataset supports a bulk_upload parameter to control data uploading behavior. Both LLMObs.create_dataset and LLMObs.create_dataset_from_csv supports users specifying the deduplicate parameter.
Subset of dataset records can now be pulled with tags by using the tags argument to LLMObs.pull_dataset, provided in a list of strings of key value pairs: LLMObs.pull_dataset(dataset_name="my-dataset", tags=["env:prod", "version:1.0"])
LLMObs.create_dataset.AttributeError on openai-agents >= 0.8.0 caused by the removal of AgentRunner._run_single_turn.gevent module unnecessarily even when the profiler was not enabled.<module> in flame graphs has been fixed.asyncio.Condition | None). This was causing a TypeError at import time for libraries such as kopf that use union type annotations at class definition time.kafka_cluster_id tag to Kafka offset/backlog tracking for confluent-kafka. Previously, cluster ID was only included in DSM checkpoint edge tags (produce/consume) but missing from offset commit and produce offset backlogs. This ensures correct attribution of backlog data to specific Kafka clusters when multiple clusters share topic names.contextvars) could cause use-after-free or double-free crashes (SIGSEGV) inside libddwaf. A per-context lock now serializes WAF calls on the same context.ddtrace.internal.wrapping.context.BaseWrappingContext.pytest-html and other third-party reporting plugins caused by the ddtrace pytest plugin using a non-standard dd_retry test outcome for retry attempts. The outcome is now set to rerun, which is the standard value used by pytest-rerunfailures and recognized by reporting plugins.RuntimeError: generator didn't yield in the Symbol DB remote config subscriber when the process has no writable temporary directory.RuntimeError during forks.LLMJudge, BooleanStructuredOutput, ScoreStructuredOutput, and CategoricalStructuredOutput to the public ddtrace.llmobs module level.Fetched March 26, 2026