v4.6.0 — Datadog dd-trace-py

Estimated end-of-life date, accurate to within three months: 05-2027 See the support level definitions for more information.

Upgrade Notes

LLM Observability
- Experiments spans now contain config from the experiment initialization, allowing for searching of relevant spans using the experiment config.
- Experiments spans now contain the tags from the dataset records, allowing for searching of relevant spans using the dataset record tags.

Deprecation Notes

tracing
- The type annotation for Span.parent_id will change from Optional[int] to int in v5.0.0.

New Features

azure-api-management
- This introduces inferred proxy support for Azure API Management.
Stats computation
- Enable stats computation by default for python 3.14 and above.
AI Guard
- Adds SDS (Sensitive Data Scanner) findings to AI Guard spans, enabling visibility into sensitive data detected in LLM inputs and outputs.
LLM Observability
- Experiments now report their execution status to the backend. Status transitions to running when execution starts, completed on success, failed when tasks or evaluators error with raise_errors=False, and interrupted when the experiment is stopped by an exception. #16713
- Adds LLMObs.publish_evaluator() to sync a locally-defined LLMJudge evaluator to the Datadog UI as a custom LLM-as-Judge evaluation.
- Adds support for DeepEval evaluations in LLM Observability Experiments by allowing users to pass a DeepEval evaluation (which either inherents from BaseMetric or BaseConversationalMetric) in an LLM Obs Experiment.
  
  Example:
```
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

from ddtrace.llmobs import LLMObs

correctness_metric = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct based on the expected output.",
    evaluation_steps=[
        "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
        "You should also heavily penalize omission of detail",
        "Vague language, or contradicting OPINIONS, are OK"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
    async_mode=True
)

dataset = LLMObs.create_dataset(
    dataset_name="<DATASET_NAME>",
    description="<DATASET_DESCRIPTION>",
    records=[RECORD_1, RECORD_2, RECORD_3, ...]
)

def my_task(input_data, config):
    return input_data["output"]

def my_summary_evaluator(inputs, outputs, expected_outputs, evaluators_results):
    return evaluators_results["Correctness"].count(True)

experiment = LLMObs.experiment(
    name="<EXPERIMENT_NAME>",
    task=my_task,
    dataset=dataset,
    evaluators=[correctness_metric],
    summary_evaluators=[my_summary_evaluator], # optional, used to summarize the experiment results
    description="<EXPERIMENT_DESCRIPTION>."
)

result = experiment.run()
```
- adds experiment summary logging after run() with row count, run count, per-evaluator stats, and error counts.
- adds max_retries and retry_delay parameters to experiment.run() for retrying failed tasks and evaluators. Example: experiment.run(max_retries=3, retry_delay=lambda attempt: 2 ** attempt).
- This introduces LLMObs.get_prompt() to retrieve managed prompts from Datadog's Prompt Registry. The method returns a ManagedPrompt object with a format() method for variable substitution. Prompt updates propagate to running applications within the cache TTL (default: 60 seconds). Use with annotation_context or annotate to correlate prompts with LLM spans:
```
prompt = LLMObs.get_prompt("greeting")
variables = {"user": "Alice"}
with LLMObs.annotation_context(prompt=prompt.to_annotation_dict(**variables)):
    openai.chat.completions.create(messages=prompt.format(**variables))
```
- experiments propagate canonical_ids from dataset records to the corresponding experiments span when present. The canonical_ids are only guaranteed to be available after calling pull_dataset.
- LLMObs.create_dataset supports a bulk_upload parameter to control data uploading behavior. Both LLMObs.create_dataset and LLMObs.create_dataset_from_csv supports users specifying the deduplicate parameter.
- Subset of dataset records can now be pulled with tags by using the tags argument to LLMObs.pull_dataset, provided in a list of strings of key value pairs: LLMObs.pull_dataset(dataset_name="my-dataset", tags=["env:prod", "version:1.0"])

Bug Fixes

LLM Observability
- Fix data duplication issue when uploading > 5MB datasets via LLMObs.create_dataset.
ai_guard
- Fix TypeError while processing failed AI Guard responses, leading to overriding the original error.
openai_agents
- Fixes an AttributeError on openai-agents >= 0.8.0 caused by the removal of AgentRunner._run_single_turn.
profiling
- A bug which could prevent Profiling from being enabled when the library is installed through Single Step Instrumentation was fixed.
- This fixes an issue where the profiler was patching the gevent module unnecessarily even when the profiler was not enabled.
- A bug that would cause certain function names to be displayed as <module> in flame graphs has been fixed.
- Fix lock contention in the profiler's greenlet stack sampler that could cause connection pool exhaustion in gevent-based applications (e.g. gunicorn + gevent + psycopg2). #16657
- This fix resolves an issue where the lock profiler's wrapper class did not support PEP 604 type union syntax (e.g., asyncio.Condition | None). This was causing a TypeError at import time for libraries such as kopf that use union type annotations at class definition time.
data_streams
- Add kafka_cluster_id tag to Kafka offset/backlog tracking for confluent-kafka. Previously, cluster ID was only included in DSM checkpoint edge tags (produce/consume) but missing from offset commit and produce offset backlogs. This ensures correct attribution of backlog data to specific Kafka clusters when multiple clusters share topic names.
AAP
- Fixes a memory corruption issue where concurrent calls to the WAF on the same request context from multiple threads (e.g. an asyncio event loop and a thread pool executor inheriting the same context via contextvars) could cause use-after-free or double-free crashes (SIGSEGV) inside libddwaf. A per-context lock now serializes WAF calls on the same context.
tracing
- Avoid pickling wrappers in ddtrace.internal.wrapping.context.BaseWrappingContext.
CI Visibility
- Fixed an incompatibility with pytest-html and other third-party reporting plugins caused by the ddtrace pytest plugin using a non-standard dd_retry test outcome for retry attempts. The outcome is now set to rerun, which is the standard value used by pytest-rerunfailures and recognized by reporting plugins.
dynamic instrumentation
- Fixes a RuntimeError: generator didn't yield in the Symbol DB remote config subscriber when the process has no writable temporary directory.
celery
- Propagate distributed tracing headers for tasks that are not registered locally so traces link correctly across workers. #16662
Fix for a potential race condition affecting internal periodic worker threads that could have caused a RuntimeError during forks.
Add a timeout to Unix socket connections to prevent thread I/O hangs during pre-fork shutdown.

Other Changes

profiling
- reduces code provenance CPU overhead when using fork-based frameworks like gunicorn and uWSGI.
LLM Observability
- Exports LLMJudge, BooleanStructuredOutput, ScoreStructuredOutput, and CategoricalStructuredOutput to the public ddtrace.llmobs module level.