LLM Observability: Adds support for DeepEval evaluations in LLM Observability Experiments by allowing users to pass a DeepEval evaluation (which either inherents from BaseMetric or BaseConversationalMetric) in an LLM Obs Experiment.
Example:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
from ddtrace.llmobs import LLMObs
correctness_metric = GEval(
name="Correctness",
criteria="Determine whether the actual output is factually correct based on the expected output.",
evaluation_steps=[
"Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
"You should also heavily penalize omission of detail",
"Vague language, or contradicting OPINIONS, are OK"
],
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
async_mode=True
)
dataset = LLMObs.create_dataset(
dataset_name="<DATASET_NAME>",
description="<DATASET_DESCRIPTION>",
records=[RECORD_1, RECORD_2, RECORD_3, ...]
)
def my_task(input_data, config):
return input_data["output"]
def my_summary_evaluator(inputs, outputs, expected_outputs, evaluators_results):
return evaluators_results["Correctness"].count(True)
experiment = LLMObs.experiment(
name="<EXPERIMENT_NAME>",
task=my_task,
dataset=dataset,
evaluators=[correctness_metric],
summary_evaluators=[my_summary_evaluator], # optional, used to summarize the experiment results
description="<EXPERIMENT_DESCRIPTION>."
)
result = experiment.run()
run() with row count, run count, per-evaluator stats, and error counts.max_retries and retry_delay parameters to experiment.run() for retrying failed tasks and evaluators. Example: experiment.run(max_retries=3, retry_delay=lambda attempt: 2 ** attempt).contextvars) could cause use-after-free or double-free crashes (SIGSEGV) inside libddwaf. A per-context lock now serializes WAF calls on the same context.ddtrace.internal.wrapping.context.BaseWrappingContext.pytest-html and other third-party reporting plugins caused by the ddtrace pytest plugin using a non-standard dd_retry test outcome for retry attempts. The outcome is now set to rerun, which is the standard value used by pytest-rerunfailures and recognized by reporting plugins.RuntimeError: generator didn't yield in the Symbol DB remote config subscriber when the process has no writable temporary directory.<module> in flame graphs has been fixed.Fetched March 26, 2026