releases.shpreview
Datadog/Datadog Blog

Datadog Blog

Mon
Wed
Fri
JunJulAugSepOctNovDecJanFebMarAprMay
Less
More
Releases172Avg57/moVersionsv2.0 to v4.0.0

In AWS environments, a data perimeter is a set of preventative controls that help ensure that your trusted cloud identities (principals or AWS services acting on your behalf) are accessing trusted resources from authorized networks. You can apply these controls at various levels of your infrastructure, such as per resource or across all resources in your AWS account.

The ability to apply controls at different levels creates an effective defense-in-depth approach to protecting data, but it also makes it hard to know where gaps exist. Datadog's 2025 Cloud Security report found that approximately 40% of organizations use data perimeters, with most applying them per resource. Of that group, [fewer than 1% use recommended organization-level solutions](https://www.datadoghq.com/state-of-cloud-security/#2:~:text=SCPs%20(0.6%25%20of%20organizations), such as resource control policies (RCPs) and service control policies (SCPs).

In this post, we'll walk through examples of data perimeters configured per resource, since that's where most organizations apply them. Then we'll look at the security gaps that resource-level controls can create. In each section, we'll simulate an attack against each gap by using Stratus Red Team, an open source threat emulation tool, and then apply an organization-level policy that closes the gap.

We'll cover four scenarios:

- Visibility: ensuring that you have visibility into data perimeter activity, an important first step before applying any controls

- Identity perimeters: preventing identities outside your AWS account from accessing your data

- Network perimeters**:** validating that identities can only access resources from authorized networks

- Resource perimeters: ensuring that identities are not able to transfer data to unauthorized resources

All of the attack techniques described in this post require Stratus Red Team to be installed and configured with a compromised-role profile. We'll walk through configuring the necessary roles and AWS profiles later.

Ensure that you have visibility into data perimeter activity across your AWS infrastructure

While not strictly a data perimeter objective, AWS security benchmarks require that CloudTrail is enabled and configured to log read and write management events. CloudTrail trails can capture when a particular policy blocked or allowed access as well as any changes to the controls themselves, such as a bucket policy modification. Because cloud logs provide valuable insights into how your data perimeters respond to requests, tampering with logging is a popular defense evasion technique.

The Stratus Red Team techniques for actions such as deleting AWS CloudTrail trails and stopping logging altogether fulfill two purposes for validating account-level cloud logging scenarios that fall outside of established organization trails. First, they confirm if calls are blocked by available policies, and second, if attempts are visible in your account's cloud logs. When attempts are visible, your logs will capture those calls the moment they are made, regardless of whether they succeed or are blocked.

The following commands set up the defense evasion scenario for a compromised role:

export AWS_PROFILE=compromised-role
stratus detonate aws.defense-evasion.cloudtrail-delete
stratus detonate aws.defense-evasion.cloudtrail-stop

The Stratus Red Team techniques for deleting or stopping cloud logging include a warmup step that uses the cloudtrail:CreateTrail permission to create new trails within the profile's associated AWS account before detonating attacks against them. CreateTrail has legitimate uses, such as centralizing logging pipelines and managing audit logs for incident response, so it's not uncommon to find it granted in real environments.

Confirm that calls are blocked by available SCP policies

Most roles, such as those for application services and CI/CD pipelines, have no legitimate reason to modify CloudTrail trails, so those calls are candidates for an SCP policy. You can attach SCPs to an organization, organization unit, or member account, creating the perimeter boundary for control plane actions such as disabling logging. That means that an appropriate SCP will block an identity—in this case, the compromised-role profile—from executing cloudtrail:DeleteTrail and cloudtrail:StopLogging calls.

The following example SCP denies modifications to a member account's CloudTrail trails:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Deny",
         "Action":[
            "cloudtrail:DeleteTrail",
            "cloudtrail:PutEventSelectors",
            "cloudtrail:StopLogging",
            "cloudtrail:UpdateTrail"
         ],
         "Resource":"*",
         "Condition":{
            "ArnNotLike":{
               "aws:PrincipalARN":"arn:aws:iam::123456789012:role/AWSCloudTrailAdmin"
            }
         }
      }
   ]
}

The cloudtrail:PutEventSelectors action is included in the deny policy because that call would enable an attacker to narrow which events get logged without disabling logging entirely, which minimizes the likelihood of detection. The cloudtrail:UpdateTrail action accounts for an attacker redirecting log delivery to an external S3 bucket, which makes the logs inaccessible to your team. The ArnNotLike condition ensures that a specific privileged role still has access to modify trails. For the account with this SCP attached, the Stratus Red Team techniques for cloudtrail:StopLogging and cloudtrail:DeleteTrail actions would be denied.

Note that SCPs do not apply to your cloud management account, where you create and apply organization-level policies. A compromised principal in that account can call cloudtrail:StopLogging regardless of your SCPs, which is why access to a high-privilege management account needs to be treated as a separate security risk.

Confirm that attempts are visible in CloudTrail logs

Even if an attempt to modify cloud logging is blocked, you want to confirm that CloudTrail recorded each event. The attempt itself is suspicious since logging is a required security benchmark. Because the Stratus Red Team techniques act on the trails they create within a member account, your organization-level trail should capture the detonation events, whether the attempt succeeds or is blocked by an SCP.

You can read AWS's guide on managing CloudTrail costs and our guide on configuring AWS CloudTrail logs for more information about how to collect them and which events are important to capture.

Enforcing an identity perimeter on your AWS account

One of the control objectives for the identity perimeter is ensuring that only your organization's principals and the AWS services acting on their behalf can access your resources. According to our report, the majority of organizations with data perimeters enforce this control primarily through policies on individual Amazon S3 buckets. Per-resource policies offer granular control, but they introduce gaps in policy coverage as your environment grows. They also create issues with policy durability when one can be modified by any principal with the right permissions.

The following bucket policy illustrates how organizations typically start building per-resource identity controls for a log bucket:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Sid":"AllowCloudTrailWrite",
         "Effect":"Allow",
         "Principal":{
            "Service":"cloudtrail.amazonaws.com"
         },

Azure Managed Redis is a Microsoft first-party, fully managed in-memory data store, replacing Azure Cache for Redis tiers. It includes Redis Enterprise features such as RediSearch for vector search and full-text search, in addition to RedisJSON, RedisTimeSeries, and Active Geo-Replication. As Azure Cache for Redis reaches end of life, more teams are planning migrations to Azure Managed Redis in search of better performance, lower cost, and modern capabilities for AI and real-time workloads.

Cache migrations are deceptively risky. Redis often handles latency-sensitive user requests, so an undersized cache, a misconfigured client, or a forgotten dependent service can result in immediate user-visible slowness or outages. Two things make the difference between a clean migration and an unsuccessful one: a single observability layer covering both sides of the cutover, and a reliable way to move data and traffic. Together, Datadog and Eden meet both requirements.

Datadog provides unified observability across the legacy cache and the new Managed Redis deployment, helping teams capture premigration performance, monitor the cutover, and validate performance after migration. Eden provides the app-aware execution layer for data replication, traffic routing, dual-write consistency, and instant rollback.

In this post, we'll explore how you can use Datadog and Eden to:

  • Establish performance baselines for the legacy cache before you migrate

  • Move the data and shift traffic with zero downtime

  • Monitor the cutover with side-by-side dashboards

  • Validate parity between the legacy cache and the new cache

  • Keep your Azure Managed Redis cache healthy after the migration

Establish performance baselines for the legacy cache before you migrate

The first migration question is also the hardest: Which Managed Redis configuration should you choose, and how many shards do you actually need? Guessing can lead to an underprovisioned cache that regresses application latency or an overprovisioned cache that wastes the cost savings the migration was supposed to deliver.

Datadog's existing Azure Cache for Redis integration and Redis integration capture the data needed to size the new cache. Use the out-of-the-box (OOTB) dashboards to extract baselines from a representative window, ideally one that includes a peak traffic event. Focus on the following metrics:

  • Peak operations per second: indicates the throughput tier you need

  • p99 cache latency: establishes the latency target that the new cache should match or improve on

  • Working set size (used memory at steady state): tells you the minimum memory configuration for the new instance

  • Hit rate: confirms how dependent your application is on cache effectiveness, which sets the bar for postmigration validation

  • Connection counts and connection churn: guide cluster sizing and reveal client pools that might need tuning before cutover

Save the baselines as Datadog notebooks so that you can refer back to them during and after the migration. You can use the same numbers to size the new instance and choose the right migration strategy. For example, a high-traffic, customer-facing system often benefits from a gradual rollout rather than a quick cutover.

You also should create an anomaly-based monitor on the legacy cache during the planning phase. If a load test or a misbehaving deployment spikes traffic in a way that would invalidate your sizing assumptions, you want to know before you begin the migration.

Move the data and shift traffic with zero downtime

With your baselines captured, the next step is to use Eden to handle the data-movement aspect of the migration. Eden's migration layer, Exodus, sits in front of both caches. You don't need to write replication code, modify your applications, or schedule a maintenance window, so you can migrate with zero downtime. With Exodus, you can:

  • Replicate data from the legacy cache to the new Managed Redis instance so that the new cache is ready to serve traffic

  • Choose a cutover strategy (canary, blue/green, tenant-by-tenant, or big bang) and let Exodus shift traffic on your schedule

  • Mirror writes to both caches during cutover so that they stay in sync

  • Throttle adaptively so that neither the legacy cache nor the new cache is overwhelmed during peak load

  • Compare responses in replicated read mode to catch feature or module incompatibilities before users notice them

  • Roll back instantly at any stage if a regression occurs

Monitor the cutover with side-by-side dashboards

While Exodus runs the cutover, you can use Datadog to catch regressions early and confirm that dependent services stay healthy as traffic shifts. Exodus runs both caches in parallel and shifts traffic from the legacy cache to the new Managed Redis instance gradually, so you can watch the new cache pick up real production traffic before you fully commit to it.

The risk in this stage is not the data movement; it's the applications that depend on the cache. In a microservices environment, dozens of services often share a single cache. A forgotten configuration change can leave one service pointed at the legacy endpoint while everything else has shifted to the new cache, producing inconsistent behavior that is difficult to diagnose later.

Datadog's Software Catalog and APM traces help you identify dependent services and verify that each one migrates correctly. Filter the Software Catalog by your existing cache resources to see the upstream and downstream services that connect to it, along with the request volume that each service generates. Use this list as your cutover checklist. As you point each service to the new Managed Redis endpoint, watch for trace metrics from the new cache resource to appear in APM. Confirm that the legacy resource's traffic drains toward zero.

During cutover, display the metrics from both caches side by side in the same Datadog dashboard. Tag each environment clearly (for example, cache:legacy and cache:managed-redis) so that a single widget can show the new cache's hit rate climbing as the legacy cache's traffic drains. If the new cache begins to diverge from the legacy cache's baselines during cutover, the side-by-side view helps you identify regressions quickly and decide whether to roll back the migration.

To make rollback decisions automatic, configure short-lived cutover monitors specifically for the migration window:

  • Latency divergence monitor: Alert if p99 latency on the new cache trends meaningfully higher than the legacy baseline. Catches sizing or networking issues early enough to roll back.

  • Error-rate divergence monitor: Alert if the new cache's error rate exceeds the legacy cache's error rate by more than a small margin. Often signals a client compatibility issue.

  • Hit-rate divergence monitor: Alert if the new cache's hit rate falls materially below the legacy baseline after a representative volume of traffic has shifted. Catches working-set or time-to-live (TTL) drift before users feel it.

Delete these monitors after the cutover is complete.

Validate parity between the legacy cache and the new cache

The migration is not finished the moment that traffic shifts. The real test is whether the new cache holds up across a full traffic cycle, including peak hours, batch jobs, and any usage patterns that the legacy cache was tuned for.

Compare the metrics from the Azure Managed Redis Overview dashboard against the baseline values that you captured in the planning stage:

  • Latency at p95 and p99 should match or improve on the legacy baselines. If it regresses, check server_load and percent_processor_time to confirm whether the gap is saturation or configuration.

  • Hit rate should be at or above the baseline after the new cache has warmed for one or two traffic cycles. A persistent gap usually means that TTL values need tuning or that a working set has grown beyond the cache configuration.

  • Used memory percentage and eviction rate should both be well below the legacy baselines if you sized the new cache correctly. A high eviction rate is the clearest sign that you need to scale up before regressions cascade.

Once you validate that the new cache matches the legacy deployment in performance and behavior, you can decommission the legacy cache. If you want a final side-by-side check before you decommission the cache, Datadog dashboards show both of the cache environments and their dependent applications in the same view throughout the validation window.

Keep your Azure Managed Redis cache healthy after migration

With the migration complete, the focus shifts to keeping the new cache healthy as it absorbs the full weight of production traffic. Dashboards are useful while a human is watching them, but monitors protect the cache the rest of the time. Configure monitors that cover the four most common issues that affect Redis cache performance:

  • Saturation: Alert on sustained high server load or CPU utilization, the clearest leading indicator of latency regressions.

  • Efficiency: Alert when the hit-to-miss ratio drops below your target, which usually means that TTL values need tuning or a working set has outgrown the cache configuration.

  • Memory pressure: Alert when memory utilization climbs and evictions begin so that you can investigate before the cache evicts frequently accessed data.

  • Availability: Alert on geo-replication health for any cache in an Active Geo-Replication group so that you can address issues before a failover exposes them.

When an alert fires, Datadog provides the troubleshooting surface to resolve it. The Managed Redis dashboard, APM traces from the applications calling the cache, and logs from surrounding services all live in the same workspace. As a result, you can cross-reference a saturation alert against the deployments, traffic spikes, or upstream issues that might have caused it.

Rely on Datadog and Eden for your Azure Managed Redis migration

For teams moving from Azure Cache for Redis to Azure Managed Redis, Datadog and Eden together support the full migration process. Datadog provides observability across the legacy cache, the new instance, and the applications that depend on either. Eden, meanwhile, moves the data and the traffic with zero downtime. As an official Microsoft Cloud Adoption Framework partner, Datadog supports the full Azure migration life cycle by helping teams apply consistent observability practices across Redis, compute, storage, and other Azure services.

To learn more, read Datadog's Azure Managed Redis integration documentation. To start a Redis migration with Eden, see Eden's Redis migration page.

If you're new to Datadog, you can sign up for a 14-day free trial to start monitoring your Azure Managed Redis caches.

Spark jobs only get more expensive and harder to debug as they scale. It's a problem we've run into ourselves. Our Referential Data Platform team builds and maintains the knowledge graph that maps relationships between customers' observability entities. ServiceQueryEdge is at the center of that graph, mapping service entities to their associated metric and log queries. It runs daily across seven datacenters, with individual partitions processing up to 27 TB of input and 16 billion records. At that scale, we were averaging $1.5k of infrastructure costs daily, with each run taking over 17 hours.

AI agents seemed like a natural fit for this problem. They're good at reasoning over code, connecting symptoms to root causes, and generating hypotheses quickly. But an agent working from code alone is still guessing. It needs to know what's actually slow.

In this post, we'll walk through how we used Datadog's Data Observability Jobs Monitoring and an AI agent built on Claude to debug and optimize ServiceQueryEdge. We'll cover what worked, what didn't, and the specific changes that cut our daily compute costs by 44% and reduced run duration by 60% in US1, our largest data center.

Closing the gap between Jobs Monitoring and the codebase

To understand where inefficiencies are, we rely on Jobs Monitoring with the Spark SQL Plan to get a visual, interactive representation of the full execution plan. However, even with that visibility, correlating a slow operator in the SQL Plan back to the relevant section of application code can still take time, particularly for a large, complex job like ServiceQueryEdge.

To speed up debugging, we built an AI agent to surface any bottlenecks across the execution graph and suggest fixes. We created a custom prompt structure that ingests the same data shown in Jobs Monitoring, such as stage metrics, the SQL execution plan, and telemetry data, alongside the source code. This allows the agent to perform correlation work that would usually fall on one of the team's engineers, saving up to hours of manual investigation. For every issue the agent flags, the engineer lands directly at the relevant node with context on why it matters.

Getting signal from noise: scoping data for AI-assisted debugging

At first, we ran into problems with Claude depleting its context while making Model Context Protocol (MCP) calls through our Datadog MCP Server to collect Spark data from Jobs Monitoring. The agent pulled job run telemetry data, represented as traces, using the get_datadog_trace, apm_search_spans, and apm_explore_trace tools. Multiple runs made the problem worse. The agent exhausted its context window before completing meaningful analysis. Suggestions became incomplete or incoherent.

We alleviated this by using subagents that delegated the acquisition of specific information into targeted tasks, preserving context for the analysis work that actually mattered. Agent output quality depended less on data volume than on how precisely that data was scoped.

However, the agent's initial suggestions didn't work. Many recommendations were either off target or addressed symptoms rather than root causes. For example, the agent suggested pruning column reads to reduce data read in, which was redundant because Spark had already handled that optimization.

Our first instinct was to try reducing the noise by feeding deeper embeddings of Spark runtime info and metrics from Jobs Monitoring into the agent. The additional context helped the agent identify more issues and generate more recommendations. But false positives also increased significantly.

Our next step was to add another subagent that filtered issues we deemed irrelevant or perfunctory. The validator works as a grader rather than a generator. It pulls the same context as the main agent, job health and the SQL plan, but its job is to find reasons a proposed fix won't work.

Optimization #2: Apply spam filter before the main join (line 766)

  Contraindication Checks:
  1. ❌ Does the spam filter actually reduce significant rows? — The LeftAnti join outputs 3,406,939,340 rows — same as the left input (3,406,939,340 from the Exchange). This means the 628 spammy metrics contribute near-zero rows to the 3.4B total. The spam filter removes metrics by (org_id, metric_id) but the main join key is (org_id, metric_name). Even if these 628 metrics were removed earlier, the 3.4B metrics-side of the main join would barely decrease.
  2. ✅ Is the filter already applied before the main join? — Looking at the data flow: metricsWithTagsRaw → spam filter → metricsWithTags → salt → metricsWithTagsSalted → main join at line 766. The spam filter IS already applied before the main join in the code.
  3. ❌ Would earlier filtering help? — The spam filter requires a groupBy($"org_id", $"metric_id").agg(countDistinct(...)) over the full metricsWithTagsRaw. This aggregation itself is expensive. Moving it "earlier" doesn't change the fact that you need the full dataset to compute the counts.

The validator's output looks like a peer review. For each suggestion, it lists checks using metrics from Jobs Monitoring and the code. If most checks fail, we discard the suggestion before wasting engineering time. The validator caught that many of the agent's top suggestions came down to implementing optimizations Spark was already doing automatically.

From theory to improvement: What actually worked

With the validator in place, we concentrated on the issues that passed scrutiny. Three optimizations proved most impactful: salting, join reordering, and broadcast hints.

Optimization 1: Salting

The SQL plan showed a 65.5% skew ratio on a large join. Salting is a technique to artificially add randomness to join keys to distribute data more evenly. It's a well-known mitigation but can introduce overhead.

The agent and validator had a back-and-forth about this one. The agent correctly identified skew on the service-metrics join. The validator confirmed that this specific join was neither broadcast nor pre-partitioned, so skew was a real problem. The implementation was standard: add salt values to both sides of the join on the service side, which already had ~1TB of memory at that stage of execution.

Result: 24% reduction in executor time on that stage

Optimization 2: Join reordering

The execution plan had a chain of joins where smaller intermediate results were being joined last, after building up massive datasets first. The agent suggested reordering to put smaller-cardinality joins earlier in the chain.

Here the validator was skeptical. It checked whether the join keys changed and whether reordering would trigger repartitions. Since these joins shared keys, reordering was safe.

Result: 15% reduction in overall execution time

Optimization 3: Broadcast hints

For one join involving a 500 MB mapping table, the agent recommended a broadcast hint to send the small side to all executors instead of shuffling both sides. The validator confirmed the size threshold was appropriate and the join type supported broadcasting.

Result: 8% reduction in executor time on that specific stage

Combined, these three changes cut daily infrastructure costs by 44% (from $1.5k to roughly $840) and reduced job runtime from 17 hours 20 minutes to about 7 hours in our largest data center. Other improvements came incrementally but summed meaningfully.

What we learned

Data matters more than volume. At the beginning, we pumped massive amounts of trace data into the agent. It wasn't helping. Once we narrowed the scope—focusing on slow operators, stage metrics, and the surrounding code—suggestions became concrete and actionable.

Validation is essential. The biggest breakthrough came from adding a second agent to validate claims. Engineers spend a lot of time on rabbit holes chasing optimizations that don't help. A validator that reasons backward from "would this fix actually improve the metric?" saves that time.

Jobs Monitoring provides the critical context. SQL plans, stage metrics, and executor metrics are what let an agent make informed recommendations. Without them, an agent is guessing based on code patterns alone. With them, it can connect symptoms in the runtime to their causes in the codebase.

Agentic AI is most useful for the correlation work. We didn't use the agent to write optimized code. We used it to surface possibilities that the team then evaluated and refined. That's where the leverage is: finding the high-value problems and putting the right person in front of the data to solve them.

For organizations running large-scale Spark workloads, the combination of Jobs Monitoring and agentic AI can compress what would normally be hours or days of manual profiling and debugging into a workflow where an engineer stays focused on evaluation and impact.

If you're running Spark jobs and want to see bottlenecks the way we do, get started with Datadog's Data Observability or explore how Bits AI Agents can help with troubleshooting.

Last Checked
5h ago
Latest
Jun 1, 2026
Tracking since Jul 9, 2015