v4.6.0rc2 — Datadog dd-trace-py

New Features

AI Guard: Adds SDS (Sensitive Data Scanner) findings to AI Guard spans, enabling visibility into sensitive data detected in LLM inputs and outputs.

LLM Observability: Adds support for DeepEval evaluations in LLM Observability Experiments by allowing users to pass a DeepEval evaluation (which either inherents from BaseMetric or BaseConversationalMetric) in an LLM Obs Experiment.

Example:

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

from ddtrace.llmobs import LLMObs

correctness_metric = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct based on the expected output.",
    evaluation_steps=[
        "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
        "You should also heavily penalize omission of detail",
        "Vague language, or contradicting OPINIONS, are OK"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
    async_mode=True
)

dataset = LLMObs.create_dataset(
    dataset_name="<DATASET_NAME>",
    description="<DATASET_DESCRIPTION>",
    records=[RECORD_1, RECORD_2, RECORD_3, ...]
)

def my_task(input_data, config):
    return input_data["output"]

def my_summary_evaluator(inputs, outputs, expected_outputs, evaluators_results):
    return evaluators_results["Correctness"].count(True)

experiment = LLMObs.experiment(
    name="<EXPERIMENT_NAME>",
    task=my_task, 
    dataset=dataset,
    evaluators=[correctness_metric],
    summary_evaluators=[my_summary_evaluator], # optional, used to summarize the experiment results
    description="<EXPERIMENT_DESCRIPTION>."
)

result = experiment.run()

LLM Observability: adds experiment summary logging after run() with row count, run count, per-evaluator stats, and error counts.

LLM Observability: adds max_retries and retry_delay parameters to experiment.run() for retrying failed tasks and evaluators. Example: experiment.run(max_retries=3, retry_delay=lambda attempt: 2 ** attempt).

Bug Fixes

AAP: Fixes a memory corruption issue where concurrent calls to the WAF on the same request context from multiple threads (e.g. an asyncio event loop and a thread pool executor inheriting the same context via contextvars) could cause use-after-free or double-free crashes (SIGSEGV) inside libddwaf. A per-context lock now serializes WAF calls on the same context.

tracing: Avoid pickling wrappers in ddtrace.internal.wrapping.context.BaseWrappingContext.

CI Visibility: Fixed an incompatibility with pytest-html and other third-party reporting plugins caused by the ddtrace pytest plugin using a non-standard dd_retry test outcome for retry attempts. The outcome is now set to rerun, which is the standard value used by pytest-rerunfailures and recognized by reporting plugins.

dynamic instrumentation: Fixes a RuntimeError: generator didn't yield in the Symbol DB remote config subscriber when the process has no writable temporary directory.

profiling: A bug that would cause certain function names to be displayed as <module> in flame graphs has been fixed.