Pulumi Insights gives you visibility and governance across your entire cloud footprint: discovery scans catalog every resource in your cloud accounts, and policy evaluations continuously enforce compliance against those resources. Until now, Insights workflows ran exclusively on Pulumi-hosted infrastructure. That works well for many teams, but enterprises with strict data residency requirements, private network constraints, or regulatory obligations need to run this work in their own environments. Today, Pulumi Insights supports customer-managed workflow runners for both SaaS Pulumi Cloud and self-hosted Pulumi Cloud installations.
Insights provides two complementary capabilities that together form a governance lifecycle for your cloud infrastructure.
Discovery scans cloud accounts across AWS, Azure, GCP, and more to catalog every resource regardless of how it was provisioned: Pulumi, Terraform, CloudFormation, or manual creation. Once cataloged, you can search, filter, group, and export your resource data. You can also import unmanaged resources into Pulumi to bring them under IaC management.
Policy enforces compliance with policy-as-code written in TypeScript or Python. Pulumi ships pre-built compliance packs for CIS, NIST, PCI DSS, HITRUST, and other frameworks so you can start evaluating without writing any code. Audit policy groups continuously evaluate all discovered resources and IaC stacks, while preventative policies block non-compliant deployments before they reach production.
This enables you to map out your cloud estate, evaluate compliance, and then remediate any issues uncovered by policy.
Running Insights on your own infrastructure with customer-managed workflow runners gives you:
Data residency: Scan execution and policy evaluation run entirely within your private network.
Private infrastructure access: Scan resources in VPCs and environments that are not accessible from the public internet.
Compliance: Cloud provider credentials can stay internal to your network, meeting regulatory requirements for credential handling.
Flexible hosting: Run workflow runners on any environment that meets your needs, including Linux, macOS, Docker, and Kubernetes.
Customer-managed workflow runners are lightweight agents that poll Pulumi Cloud for pending work, execute it locally, and report results back. You can configure runners to handle specific workflow types: discovery scans, policy evaluations, deployments, or all three.
This works identically whether you use SaaS Pulumi Cloud or a self-hosted installation. The runner communicates with the Pulumi Cloud API over HTTPS, so no inbound connectivity is required, making it well suited to run in restricted network environments.
Under the hood, this is powered by a distributed work scheduling system that routes activities to the right runner pool, handles lease-based execution, and recovers automatically from failures. For a deep dive on the architecture, see How We Built a Distributed Work Scheduling System for Pulumi Cloud.
If your team already uses customer-managed workflow runners for Pulumi Deployments, your existing runner pools can handle Insights workflows with no additional infrastructure.
Self-hosted Insights is available on the Business Critical edition of Pulumi Cloud. To learn more or get set up:
Self-hosted Insights documentation — configuration and setup for discovery scans and audit policy evaluations on your own infrastructure
Customer-managed workflow runners — runner installation, configuration reference, and pool management
Insights & Governance overview — full documentation for discovery and policy capabilities
Contact sales to enable self-hosted Insights for your organization
Pulumi Cloud orchestrates a growing number of workflow types: Deployments, Insights discovery scans, and policy evaluations. Some of that work runs on Pulumi’s infrastructure, and some of it runs on yours via customer-managed workflow runners. We needed a scheduling system that could handle all of these workflow types reliably across both environments. In this post, we’ll take a look at the system we built.
For our first workflow integration, Deployments, scheduling wasn’t too complicated. A deployment was queued, a worker picked it up, and it ran. The queue was purpose-built for deployments, and it worked well for that single use case. Over time, we added more sophisticated logic to handle retries, ordering, rate limiting, observability, and more.
With the launch of Insights, the number of workflow types grew. Now Pulumi Cloud manages discovery scans to catalog cloud resources and runs audit policy evaluations to continuously verify compliance. While these workflows share similarities, each type needed its own scheduling, retry logic, and failure handling.
Later we added the option for customers to run workflows on their own infrastructure using customer-managed workflow runners. As the complexity of these requirements grew, we knew that our initial approach for Deployments wasn’t going to scale. We needed a single system that could schedule any type of work, route it to the right place, and handle the messy reality of distributed execution: crashes, network failures, rate limits, and retries.
We call this the background activity system.
Why build this instead of using Amazon SQS, RabbitMQ, or one of the many existing queue libraries? We considered these options but chose to build our own for a few reasons.
Pulumi Cloud supports self-hosted installations, including air-gapped environments. We intentionally minimize external dependencies so that self-hosted customers don’t have to stand up additional infrastructure. A system built on an external queue works fine for our hosted service, but it means self-hosted customers would need to provide a compatible backend. By building on top of the database we already require, we avoid adding another system to maintain.
More importantly, queueing is only part of the problem. What we actually need is scheduling with durability. This means ensuring that remote workers don’t lose activities on restart, priority so that urgent work gets compute resources first, constraints like “only so many scans per org at a time,” structured logging for observability, and checkpointing so that long-running operations can resume after a failure.
These features can be layered onto a generic queue library but can require more code than implementing them directly. For example, priority queues are often implemented with multiple ranked queues, but this breaks single-activity-at-a-time constraints. A second queue wouldn’t see a job already running in the first one. There’s no way for producers in a distributed system to coordinate across the queues without support in the queuing system itself.
Capacity management is another area where generic queues fall short. Distributed systems need to respond dynamically to slowdowns, network interruptions, and rate limits from downstream services. These are common low-level details that every workflow type needs, and building them into the scheduling layer means individual handlers don’t have to solve them independently.
We also need structured logging that works everywhere, including on customer-managed runners behind firewalls where centralized logging services aren’t accessible.
Building this ourselves gave us a system that works with existing infrastructure and handles these requirements natively.
With that context, here are some of the constraints that shaped the design:
Pull-only agents. Customer-managed workflow runners live behind NATs, corporate proxies, and air-gapped networks. They can’t accept inbound connections, so all communication has to be agent-initiated.
Mixed execution environments. The same system needs to work for Pulumi-hosted workers (with direct access to internal systems) and customer-managed runners (communicating entirely over REST). We didn’t want to maintain two separate code paths.
Different workflow types. Deployments, Insights scans, and audit policy evaluations have different payloads and execution semantics, but they all need the same scheduling guarantees: exactly-once execution, automatic retries, failure recovery, and observability.
Automatic fault tolerance. Agents crash, networks drop, and machines get recycled by autoscalers. The system needs to detect these failures and recover without needing a person to step in.
Extensibility. We knew we’d keep adding workflow types. Adding a new one should mean writing a handler and registering it, not building new infrastructure.
At the center of the system is the background activity, a persistent, typed work unit. Each activity includes:
A type discriminator that identifies what kind of work it represents (e.g., “insights-discovery” or “policy-evaluation”)
A payload specific to that type, containing whatever data the handler needs
A routing context that determines which runner pool should execute it
Scheduling metadata like priority, activation time, and retry configuration
A status tracking where the activity is in its lifecycle
The type discriminator makes this system polymorphic. The scheduling engine doesn’t need to know what’s inside the payload. It moves activities through their lifecycle and delegates the actual work to a type-specific handler.
Every activity follows the same lifecycle regardless of type:
graph LR Start(( )) -->|Created| Ready Ready -->|Leased| Pending Pending -->|Started| Executing Executing -->|Success| Completed Completed --> End(( )) Executing -->|Error| Failed Failed --> End Executing -->|Canceled| Canceled Canceled --> End Executing -->|Dependencies| Waiting Waiting -->|Unblocked| Ready Pending -->|Lease expired| Restarting Executing -->|Lease expired| Restarting Restarting -->|Re-lease| Ready style Start fill:#000,stroke:#000,color:#000 style End fill:#000,stroke:#000,color:#000
The states fall into two groups:
Running states (work is in flight or can be resumed):
Ready: queued and eligible to be claimed by a worker
Pending: claimed by a worker, execution about to start
Executing: actively running on a worker
Waiting: parked, blocked on one or more dependency activities
Restarting: recovered after a worker failure, ready to be re-claimed
Terminal states (work is done):
New workflow types get these features automatically: scheduling, retries, dependency management, and observability.
A central challenge of any distributed work queue is preventing double-execution. If two agents try to execute the same activity simultaneously, you get duplicate work and data corruption. A central coordinator can solve this, but it becomes a single point of failure.
We use lease-based optimistic concurrency instead. This is a well-known pattern, adapted here for long-running, stateful workflows.
When an agent is ready for new work, it asks the service to lease an activity. The service atomically selects the highest-priority ready activity, assigns a lease token with an expiration time, and transitions the activity to Pending. No other agent can claim the same activity.
sequenceDiagram participant Agent participant Service Agent->>Service: Poll for work Note right of Service: Select highest-priority Ready activity Note right of Service: Atomically set lease token + expiration Service->>Agent: Lease (token, expiration) Agent->>Service: Begin execution Note right of Service: Transition to Executing Note over Agent: Work in progress... Agent->>Service: Renew lease Service->>Agent: New expiration Note over Agent: Work continues... Agent->>Service: Complete (token, result) Note right of Service: Transition to Completed Archive activity Service->>Agent: Acknowledged
While executing, the agent periodically renews its lease to signal that it’s still working. If the agent crashes, loses network connectivity, or is terminated, it stops renewing. Once the lease expires, the service transitions the activity to Restarting, making it available for another agent to claim.
sequenceDiagram participant A as Agent A participant S as Service participant B as Agent B A->>S: Lease activity S->>A: Token + expiration Note over A: Executing... A--xS: Agent A crashes Note right of S: Lease expires Note right of S: Transition → Restarting B->>S: Poll for work Note right of S: Lease to Agent B Transition → Pending S->>B: Token + expiration Note over B: Agent B continues execution
The service doesn’t need to explicitly coordinate between workers because leases are acquired using atomic database operations. The lease expiration is the failure detector; if a lease expires, then the work needs to be rescheduled.
Pulumi Cloud supports multiple workflow runner pools. An organization might have one pool for production in us-east-1, another for staging in eu-west-1, and use Pulumi-hosted runners for development. Work needs to reach the right pool.
Each activity carries a routing context that identifies which runner pool should execute it. When a runner polls for work, it filters by its own pool identifier so that it only sees activities meant for it.
We use prefix matching for this filtering. A runner matches activities whose context starts with its pool’s identifier. This means the service can use hierarchical contexts (e.g., pool-abc/insights/scan-123) and runners will still match on the pool prefix. Cleanup is also straightforward; when a runner pool is deleted, all activities with that context prefix are bulk-canceled.
This routing mechanism works the same way regardless of workflow type, and adding a new workflow type doesn’t require changes to the routing layer.
Some workflows are naturally multi-step. An Insights discovery scan might discover resources that then need policy evaluation. Rather than building a separate orchestration engine, we built dependency management into the activity system.
An activity can declare a dependency set: a list of other activities that must complete before it can run. A dependent activity enters the Waiting state when created. As its dependencies complete, the system checks whether all prerequisites are satisfied. When the last one finishes, the waiting activity transitions to Ready and enters the scheduling queue.
graph TD A["Insights Discovery Executing"] -->|depends on| B["Policy Evaluation Waiting"] A -->|completes| C["Insights Discovery Completed"] C -->|triggers| D["Policy Evaluation Ready (auto-scheduled)"]
This gives us a lightweight DAG of work without requiring a separate workflow engine. Dependent activities get the same guarantees as any other activity: lease-based execution, automatic recovery, and observability.
This is where the design really pays off for customer-managed runners. The system supports two execution modes:
Direct mode runs in-process alongside the Pulumi Cloud service. Workers have low-latency access to internal systems and can process activities with minimal overhead. This is what Pulumi-hosted runners use.
Remote mode communicates over REST APIs. The runner polls for activities, leases them, executes work locally, and reports results back over HTTP. This is what customer-managed runners use. No database access, no internal network access, no inbound connectivity required.
Both modes share the same handler interface so that a workflow handler doesn’t need to know where it’s running. Whether it’s running on Pulumi’s hosted infrastructure or on a customer’s Kubernetes cluster, the handler simply processes the payload and reports a result.
Let’s walk through a concrete example. A user wants to run an Insights discovery scan on an AWS account using a customer-managed workflow runner.
sequenceDiagram participant User participant PC as Pulumi Cloud participant Runner as Workflow Runner (Customer Infrastructure) User->>PC: 1. Configure Insights scan for runner pool Note right of PC: 2. Create background activity type: insights-discovery context: runner-pool-xyz status: Ready Runner->>PC: 3. Poll for work (filtered by pool) PC->>Runner: 4. Lease activity (token + expiration) Runner->>PC: 5. Initialize workflow PC->>Runner: Return cloud credentials + job token Note over Runner: 6. Execute scan locally (talks directly to cloud APIs — credentials are used only on the runner) Runner->>PC: 7. Renew lease Note right of PC: Extend expiration Runner->>PC: 8. Report completion Note right of PC: 9. Mark completed Archive activity Note right of PC: 10. Unblock dependent activities (e.g., policy evaluation)
A user configures an AWS account for Insights scanning in Pulumi Cloud and assigns it to a workflow runner pool.
Pulumi Cloud creates a background activity with the type set to insights discovery, the routing context set to the runner pool, and the payload containing the account configuration.
A customer-managed workflow runner polling that pool detects new work.
The runner leases the activity, acquiring an exclusive lock via the lease token.
The runner initializes the workflow, receiving any required cloud provider credentials (e.g., resolved from Pulumi ESC (Environments, Secrets, and Configuration)) and a job token from Pulumi Cloud.
The runner executes the scan locally on the customer’s infrastructure, talking directly to the cloud provider APIs.
During execution, the runner periodically renews its lease to signal liveness.
The scan completes, and the runner reports the result back to Pulumi Cloud.
The service marks the activity as completed and archives it.
If a dependent policy evaluation activity was waiting on this scan, it automatically transitions to Ready and enters the scheduling queue, where another runner in the pool can pick it up.
This flow works the same way whether the runner is hosted by Pulumi or by the customer. The only difference is whether the execution mode is direct or remote.
Failures are expected in distributed systems. The background activity system handles them at several levels:
Lease expiration covers hard failures like agent crashes, network partitions, and machine terminations. If a lease expires, the activity moves to Restarting, and is available for another agent to pick up.
Handler-controlled retries cover soft failures like transient API errors and rate limits. A handler can request a reschedule with a delay, putting the activity back in Ready with a future activation time.
Automatic retries provide a configurable retry budget per activity. Each activity can specify how many times it should be retried and the delay between attempts, preventing runaway retry loops.
Priority scheduling ensures urgent work gets processed first. Higher-priority activities are leased before lower-priority ones, even if the lower-priority activity has been waiting longer.
Lease renewal during slowdowns keeps the activity alive without blocking other work, even if a downstream service is slow. The agent continues renewing its lease while it waits, and the scheduler remains free to assign other activities to other agents.
Every activity generates a structured log of its execution, including timestamps, severity levels, and code context. Logs are stored with the activity record and are accessible via an API and admin tooling.
This is especially useful for customer-managed runners, where the service can’t directly observe the execution environment. The structured log gives operators visibility into the execution context, even when the runner is behind a firewall. Handlers can also use these logs as a progress journal, encoding checkpoints that allow a restarted activity to pick up where it left off rather than starting from scratch.
Retention policies are configurable per organization and per workflow type. Completed activities can be retained for auditing or purged to manage storage, and failed activities are typically retained longer for debugging.
A generic system pays off quickly. Our initial instinct was to build targeted solutions for each workflow type. Investing in a generic activity system required more upfront design work, but now adding a new workflow type requires a fraction of the effort it would take otherwise. New workflows ship with full scheduling, retry, and observability support from day one.
Leases handle many failure modes. We evaluated several approaches for distributed work coordination, including message queues with explicit acknowledgment and coordinator-based assignment. The lease model works well because all failure modes are handled through timeouts. If an agent is running as expected, it renews. If it isn’t, the lease expires.
Keeping the execution paths symmetric requires discipline. Making the hosted and self-hosted paths share the same handler interface was a deliberate choice. It would be easy to add shortcuts for the hosted path that bypass the remote API, but resisting that temptation means that features work for both cases automatically.
The hard part isn’t running the work. Running a scan or a deployment is straightforward once you have the right credentials. The real complexity is in everything around the execution: scheduling, routing, leasing, retrying, resolving dependencies, and cleaning up. These operational concerns aren’t visible to users, but they are essential to providing a reliable experience.
Today this system powers deployments, Insights discovery scans, and policy evaluations across both Pulumi Cloud and customer-managed infrastructure. The architecture is general enough that every new workflow type we add inherits the full scheduling, routing, retry, and observability stack without additional plumbing.
If you’re interested in running workflows on your own infrastructure, check out customer-managed workflow runners. To see how Insights can help you understand and manage your cloud infrastructure, get started with Pulumi Insights.
When you manage dozens of data-loading pipelines, copying and pasting IaC configurations between them is a recipe for mishap. IAM policies can drift, naming conventions diverge, and every new source is a new opportunity to make a mistake — not to mention compound the problem of duplication. In this post, we’ll show you how you can identify and encapsulate common patterns into composable components and walk through the production lessons we’ve learned running 25+ pipelines for over three years.
If you’re loading data into Snowflake and want reusable, composable infrastructure, this post is for you. Here’s what we’ll cover:
Handling and validating GitHub webhooks with AWS Lambda
Streaming webhook payloads directly into Snowflake with Amazon Data Firehose
Wiring it all up with a reusable Pulumi ComponentResource
The companion template also includes S3 auto-ingest and batch loading patterns, which we’ll cover in upcoming posts. We also use Pulumi ESC to handle authentication to both AWS and Snowflake using OpenID Connect.
Our own Josh Kodroff wrote an excellent introduction to Snowpipe with Pulumi. This post builds on his work using the newest Snowflake and AWS provider APIs and the direct Firehose-to-Snowflake destination, which wasn’t available when Josh wrote his post. Some resource names and grant patterns will also differ if you’re comparing the two.
The following diagram shows the architecture in more detail:
GitHub sends webhook events to a Lambda Function URL.
Lambda validates the HMAC (Hash-based Message Authentication Code) signature and forwards the payload to Amazon Data Firehose.
Firehose streams records directly into Snowflake via the Snowpipe Streaming API. Data appears in Snowflake within seconds.
S3 is used only as a backup destination for failed records.
The direct Firehose-to-Snowflake destination is an AWS-native feature that works with any Snowflake account.
Start a new Pulumi Python project and choose uv for dependency management when prompted:
mkdir snowpipe-data-loading && cd snowpipe-data-loading
pulumi new python
Notice Pulumi.yaml shows uv as your selected toolchain:
name: snowpipe-data-loading
runtime:
name: python
options:
toolchain: uv
Add the provider dependencies for AWS, Snowflake, GitHub, and the Random and TLS providers:
uv add pulumi-aws pulumi-snowflake pulumi-github pulumi-random pulumi-tls
That’s it. uv creates the virtual environment and lockfile automatically, and Pulumi uses uv run under the hood to execute your program.
All examples in this post are in Python, but Pulumi supports multiple languages. You can implement the same components in TypeScript, Go, .NET, Java, or YAML.
This project needs credentials for AWS, Snowflake, and GitHub. Rather than managing long-lived secrets locally, we can use Pulumi ESC to obtain dynamic, short-lived OIDC credentials for AWS and Snowflake at runtime. When you run pulumi up, ESC exchanges a Pulumi-issued OIDC token for temporary credentials from each provider and injects them into your stack config automatically. If you prefer not to use ESC, you can set credentials directly with pulumi config set --secret.
Here is a single ESC environment that handles all three providers:
values:
aws:
login:
fn::open::aws-login:
oidc:
duration: 1h
roleArn: arn:aws:iam::123456789012:role/pulumi-esc-oidc
sessionName: pulumi-snowpipe-demo
snowflake:
login:
fn::open::snowflake-login:
oidc:
account:
user: ESC_SERVICE_USER
organizationName:
accountName:
environmentVariables:
AWS_ACCESS_KEY_ID: ${aws.login.accessKeyId}
AWS_SECRET_ACCESS_KEY: ${aws.login.secretAccessKey}
AWS_SESSION_TOKEN: ${aws.login.sessionToken}
SNOWFLAKE_USER: ${snowflake.login.user}
SNOWFLAKE_TOKEN: ${snowflake.login.token}
pulumiConfig:
aws:region: us-west-2
snowflake:organizationName: ${snowflake.organizationName}
snowflake:accountName: ${snowflake.accountName}
snowflake:authenticator: OAUTH
snowflake:role: PULUMI_DEPLOYER
github:token:
fn::secret:
github:owner:
Then reference the environment from your stack config:
# Pulumi..yaml
environment:
- /my-snowpipe-env
config:
snowpipe-data-loading:database: LANDING_ZONE_WEBHOOKS
snowpipe-data-loading:environment: dev
snowpipe-data-loading:webhook-repo:
snowflake:previewFeaturesEnabled:
- snowflake_table_resource
Depending on your preferences, you can split credentials into separate per-provider environments and compose them with imports and reuse across stacks.
To set up OIDC trust for each provider, see the AWS OIDC guide and the Snowflake OIDC login guide. For GitHub authentication options (fine-grained PATs, classic PATs, or GitHub Apps), see the pulumi-github provider docs.
The direct streaming pipeline needs an S3 bucket for backup/errors, a Snowflake database, and a schema.
Amazon Data Firehose supports Snowflake as a native destination via the Snowpipe Streaming API. Firehose streams records directly into Snowflake.
The Lambda function is the entry point for GitHub webhooks. It validates the HMAC-SHA256 signature, wraps the payload in an envelope with the event type, and forwards it to Firehose. Create this as lambda/webhook_handler.py:
import hashlib
import hmac
import json
import os
import boto3
firehose = boto3.client("firehose")
STREAM_NAME = os.environ["FIREHOSE_STREAM_NAME"]
WEBHOOK_SECRET = os.environ["WEBHOOK_SECRET"]
def handler(event, context):
body = event.get("body", "")
signature = (event.get("headers") or {}).get("x-hub-signature-256", "")
# Validate HMAC-SHA256 signature
expected = "sha256=" + hmac.new(
WEBHOOK_SECRET.encode(), body.encode(), hashlib.sha256
).hexdigest()
if not hmac.compare_digest(expected, signature):
return {"statusCode": 401, "body": "Invalid signature"}
github_event = (event.get("headers") or {}).get("x-github-event", "unknown")
# Wrap in envelope - newline-delimited for S3 backup where Firehose concatenates records
record = json.dumps({
"github_event": github_event,
"payload": json.loads(body),
}) + "\n"
firehose.put_record(
DeliveryStreamName=STREAM_NAME,
Record={"Data": record.encode()},
)
return {"statusCode": 200, "body": "OK"}
The envelope format {"github_event": "", "payload": {...}}\n is important. The github_event field (e.g., push, pull_request, star) comes from the x-github-event header and lets downstream queries filter by event type. The trailing newline delimits records in the S3 backup destination, where Firehose concatenates them into files.
Another interesting part of this snippet is that instead of manually creating a secret which would have to be copy-pasted into both your webhook configuration and your Lambda environment, we are using random.RandomPassword to generate and store it securely it in Pulumi state.
The secret is automatically wired to both the Lambda env var and the GitHub webhook config, and it rotates cleanly if you ever need to replace it.
Data like this is normally written to and loaded from Amazon S3. But with an S3 intermediate path, you must wait for Firehose to buffer records (60 seconds), then for Snowpipe to detect the new file and load it. Total latency: about two minutes. With the direct Snowflake destination, Firehose uses the Snowpipe Streaming API to insert records as soon as they arrive, in seconds.
The direct path also removes the need for S3 event notifications, SQS queues, external stages, and pipe resources. S3 is still used, but only as a backup destination for failed records.
The component creates everything needed for the direct path: a TLS key pair for Snowflake authentication, a Snowflake service user with least-privilege grants, the landing table, a Firehose delivery stream with destination="snowflake", and a Lambda function with a public URL. Create an empty components/__init__.py so that Python treats the directory as a package.
The full component lives in components/direct_snowflake_ingestion.py:
import json
from dataclasses import dataclass
import pulumi
import pulumi_aws as aws
import pulumi_snowflake as snowflake
import pulumi_tls as tls
# In the example repository, you will find this class imported from a common library file instead
@dataclass
class ColumnDef:
"""Column definition for a Snowflake table."""
name: str
type: str
nullable: bool = True
def strip_pem_headers(pem: str) -> str:
"""Remove PEM header/footer lines, returning only the base64 content."""
lines = pem.strip().split("\n")
return "".join(lines[1:-1])
@dataclass
class DirectSnowflakeIngestionArgs:
bucket_arn: pulumi.Input[str]
bucket_name: pulumi.Input[str]
database: pulumi.Input[str]
schema_name: pulumi.Input[str]
table_name: str
table_columns: list[ColumnDef]
snowflake_account_url: pulumi.Input[str]
snowflake_role_name: str
lambda_code: pulumi.Archive
lambda_handler: str
lambda_environment: dict[str, pulumi.Input[str]]
table_comment: str = ""
s3_prefix: str = "direct-webhooks"
s3_backup_mode: str = "FailedDataOnly"
buffering_interval: int = 0
buffering_size: int = 1
retry_duration: int = 60
data_loading_option: str = "VARIANT_CONTENT_AND_METADATA_MAPPING"
content_column_name: str = "CONTENT"
metadata_column_name: str = "METADATA"
class DirectSnowflakeIngestion(pulumi.ComponentResource):
function_url: pulumi.Output[str]
firehose_stream_name: pulumi.Output[str]
snowflake_user_name: pulumi.Output[str]
def __init__(
self,
name: str,
args: DirectSnowflakeIngestionArgs,
opts: pulumi.ResourceOptions | None = None,
):
super().__init__(
"snowpipe:direct:DirectSnowflakeIngestion", name, {}, opts
)
# --- TLS key pair for Snowflake auth ---
key_pair = tls.PrivateKey(
f"{name}-keypair",
algorithm="RSA",
rsa_bits=2048,
opts=pulumi.ResourceOptions(parent=self),
)
# --- Snowflake role, user, and grants ---
sf_role = snowflake.AccountRole(
f"{name}-sf-role",
name=args.snowflake_role_name,
opts=pulumi.ResourceOptions(parent=self),
)
user_name = f"FIREHOSE_{name.upper().replace('-', '_')}_USER"
sf_user = snowflake.ServiceUser(
f"{name}-sf-user",
name=user_name,
login_name=user_name,
default_role=sf_role.name,
rsa_public_key=key_pair.public_key_pem.apply(strip_pem_headers),
opts=pulumi.ResourceOptions(parent=self),
)
# Landing table
table = snowflake.Table(
f"{name}-table",
name=args.table_name,
database=args.database,
schema=args.schema_name,
comment=args.table_comment,
columns=[
snowflake.TableColumnArgs(
name=col.name, type=col.type, nullable=col.nullable,
)
for col in args.table_columns
],
opts=pulumi.ResourceOptions(parent=self),
)
# Grants: DB USAGE, schema USAGE, table INSERT+SELECT
snowflake.GrantPrivilegesToAccountRole(
f"{name}-grant-db-usage",
account_role_name=sf_role.name,
privileges=["USAGE"],
on_account_object=snowflake.GrantPrivilegesToAccountRoleOnAccountObjectArgs(
object_type="DATABASE",
object_name=args.database,
),
opts=pulumi.ResourceOptions(parent=self),
)
snowflake.GrantPrivilegesToAccountRole(
f"{name}-grant-schema-usage",
account_role_name=sf_role.name,
privileges=["USAGE"],
on_schema=snowflake.GrantPrivilegesToAccountRoleOnSchemaArgs(
schema_name=pulumi.Output.all(
args.database, args.schema_name
).apply(lambda parts: f'"{parts[0]}"."{parts[1]}"'),
),
opts=pulumi.ResourceOptions(parent=self),
)
table_name = args.table_name
snowflake.GrantPrivilegesToAccountRole(
f"{name}-grant-table",
account_role_name=sf_role.name,
privileges=["INSERT", "SELECT"],
on_schema_object=snowflake.GrantPrivilegesToAccountRoleOnSchemaObjectArgs(
object_type="TABLE",
object_name=pulumi.Output.all(
args.database, args.schema_name
).apply(
lambda parts: f'"{parts[0]}"."{parts[1]}"."{table_name}"'
),
),
opts=pulumi.ResourceOptions(parent=self, depends_on=[table]),
)
# --- Firehose IAM role (S3 backup write) ---
firehose_role = aws.iam.Role(
f"{name}-firehose-role",
assume_role_policy=json.dumps({
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Principal": {"Service": "firehose.amazonaws.com"},
}],
}),
opts=pulumi.ResourceOptions(parent=self),
)
aws.iam.RolePolicy(
f"{name}-firehose-s3-policy",
role=firehose_role.id,
policy=args.bucket_arn.apply(
lambda arn: json.dumps({
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"s3:AbortMultipartUpload",
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:PutObject",
],
"Resource": [arn, f"{arn}/*"],
}],
})
),
opts=pulumi.ResourceOptions(parent=self),
)
# --- Firehose delivery stream (Snowflake destination) ---
stream = aws.kinesis.FirehoseDeliveryStream(
f"{name}-firehose",
destination="snowflake",
snowflake_configuration=aws.kinesis.FirehoseDeliveryStreamSnowflakeConfigurationArgs(
account_url=args.snowflake_account_url,
database=args.database,
schema=args.schema_name,
table=args.table_name,
role_arn=firehose_role.arn,
user=sf_user.name,
private_key=key_pair.private_key_pem_pkcs8.apply(
strip_pem_headers
),
data_loading_option=args.data_loading_option,
content_column_name=args.content_column_name,
metadata_column_name=args.metadata_column_name,
s3_backup_mode=args.s3_backup_mode,
buffering_size=args.buffering_size,
buffering_interval=args.buffering_interval,
retry_duration=args.retry_duration,
snowflake_role_configuration=aws.kinesis.FirehoseDeliveryStreamSnowflakeConfigurationSnowflakeRoleConfigurationArgs(
enabled=True,
snowflake_role=args.snowflake_role_name,
),
s3_configuration=aws.kinesis.FirehoseDeliveryStreamSnowflakeConfigurationS3ConfigurationArgs(
bucket_arn=args.bucket_arn,
role_arn=firehose_role.arn,
prefix=f"{args.s3_prefix}/backup/",
error_output_prefix=f"{args.s3_prefix}/errors/",
),
),
opts=pulumi.ResourceOptions(parent=self, depends_on=[table]),
)
# --- Lambda function + Function URL ---
lambda_role = aws.iam.Role(
f"{name}-lambda-role",
assume_role_policy=json.dumps({
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Principal": {"Service": "lambda.amazonaws.com"},
}],
}),
opts=pulumi.ResourceOptions(parent=self),
)
aws.iam.RolePolicyAttachment(
f"{name}-lambda-basic-execution",
role=lambda_role.name,
policy_arn="arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole",
opts=pulumi.ResourceOptions(parent=self),
)
aws.iam.RolePolicy(
f"{name}-lambda-firehose-policy",
role=lambda_role.id,
policy=stream.arn.apply(
lambda arn: json.dumps({
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["firehose:PutRecord"],
"Resource": [arn],
}],
})
),
opts=pulumi.ResourceOptions(parent=self),
)
env_vars = {
**args.lambda_environment,
"FIREHOSE_STREAM_NAME": stream.name,
}
fn = aws.lambda_.Function(
f"{name}-handler",
runtime="python3.11",
handler=args.lambda_handler,
role=lambda_role.arn,
timeout=30,
code=args.lambda_code,
environment=aws.lambda_.FunctionEnvironmentArgs(
variables=env_vars,
),
opts=pulumi.ResourceOptions(parent=self),
)
fn_url = aws.lambda_.FunctionUrl(
f"{name}-function-url",
function_name=fn.name,
authorization_type="NONE",
opts=pulumi.ResourceOptions(parent=self),
)
aws.lambda_.Permission(
f"{name}-function-url-permission",
action="lambda:InvokeFunctionUrl",
function=fn.name,
principal="*",
function_url_auth_type="NONE",
opts=pulumi.ResourceOptions(parent=self),
)
# --- Outputs ---
self.function_url = fn_url.function_url
self.firehose_stream_name = stream.name
self.snowflake_user_name = sf_user.name
self.register_outputs({
"function_url": self.function_url,
"firehose_stream_name": self.firehose_stream_name,
"snowflake_user_name": self.snowflake_user_name,
})
A few things to note:
TLS key pair for authentication. The component generates an RSA key pair using the pulumi-tls provider. The public key is assigned to the Snowflake service user; the private key (PKCS#8 format, base64-encoded) is passed to Firehose. No passwords or OAuth tokens are stored.
ServiceUser instead of User. Snowflake service users can’t log in interactively. They authenticate only via key pair, which is exactly what Firehose needs.
destination="snowflake" on Firehose. This tells Firehose to use the Snowpipe Streaming API rather than writing to S3. The s3_configuration block is still required, but only for backup/error records.
Immediate flushing. buffering_interval=0 and buffering_size=1 ensure records are sent to Snowflake as soon as they arrive, minimizing latency. Tune according to your needs.
**
Amazon Data Firehose does not connect from fixed IP addresses, so you cannot use Snowflake network policies to restrict access by IP. If your Snowflake account uses network policies, you have three options: use AWS PrivateLink (requires Snowflake Business Critical edition), allow public internet access for the Firehose service user, or switch to S3 auto-ingest via Snowpipe which does not require direct network access to Snowflake from Firehose.
With the component defined, __main__.py wires the direct ingestion pipeline:
import pulumi
import pulumi_aws as aws
import pulumi_github as github
import pulumi_random as random
import pulumi_snowflake as snowflake
from components.direct_snowflake_ingestion import (
DirectSnowflakeIngestion, DirectSnowflakeIngestionArgs,
)
from components.snowpipe_pipeline import ColumnDef
config = pulumi.Config()
database_name = config.get("database") or "LANDING_ZONE_WEBHOOKS"
environment = config.get("environment") or "dev"
# --- Shared infrastructure ---
# S3 bucket for backup/errors
bucket = aws.s3.Bucket(
"data-landing-bucket",
)
# Snowflake database and schema
database = snowflake.Database("demo-database", name=database_name)
schema = snowflake.Schema(
"demo-schema",
name="GITHUB",
database=database.name,
)
# --- Direct ingestion pipeline ---
webhook_repo = config.require("webhook-repo")
# Snowflake account URL for Firehose configuration
snowflake_config = pulumi.Config("snowflake")
snowflake_account_url = (
f"https://{snowflake_config.require('organizationName')}"
f"-{snowflake_config.require('accountName')}"
f".snowflakecomputing.com"
)
# Landing table columns: CONTENT (webhook JSON) + METADATA (Firehose metadata)
DIRECT_COLUMNS = [
ColumnDef(name="CONTENT", type="VARIANT", nullable=True),
ColumnDef(name="METADATA", type="VARIANT", nullable=True),
]
# Step 1: Generate webhook secret for HMAC validation
direct_webhook_secret = random.RandomPassword(
"github-direct-webhook-secret", length=32, special=False
)
# Step 2: Direct ingestion pipeline - Lambda validates, Firehose streams to Snowflake
direct = DirectSnowflakeIngestion(
"github-webhooks-direct",
DirectSnowflakeIngestionArgs(
bucket_arn=bucket.arn,
bucket_name=bucket.bucket,
database=database.name,
schema_name=schema.name,
table_name="REPOSITORY_EVENTS_DIRECT",
table_columns=DIRECT_COLUMNS,
table_comment="GitHub webhook events loaded via direct Firehose to Snowflake",
snowflake_account_url=snowflake_account_url,
snowflake_role_name="FIREHOSE_DIRECT_LOADER",
lambda_code=pulumi.AssetArchive({
"webhook_handler.py": pulumi.FileAsset("lambda/webhook_handler.py"),
}),
lambda_handler="webhook_handler.handler",
lambda_environment={"WEBHOOK_SECRET": direct_webhook_secret.result},
),
)
# Step 3: GitHub webhook - sends events to the Lambda Function URL
github.RepositoryWebhook(
"github-direct-webhook",
repository=webhook_repo,
configuration=github.RepositoryWebhookConfigurationArgs(
url=direct.function_url,
content_type="json",
secret=direct_webhook_secret.result,
),
events=["push", "pull_request", "issues", "star"],
)
# Exports
pulumi.export("webhook_url", direct.function_url)
pulumi.export("firehose_stream", direct.firehose_stream_name)
That’s the entire pipeline. One component, one GitHub webhook, one secret. The DirectSnowflakeIngestion component handles the TLS key pair, Snowflake service user, landing table, Firehose stream, and Lambda function internally, and now you can reuse this component for as many pipelines as you need.
The full code for this example is available on GitHub:
[
** github.com/pulumi-demos/examples/tree/main/python/aws-snowflake-data-loading-real-time
](https://github.com/pulumi-demos/examples/tree/main/python/aws-snowflake-data-loading-real-time)
Deploy the stack:
pulumi up
The entire stack deploys in about two minutes. Immediately after deployment, you’ll start seeing GitHub events flowing into Snowflake.
Before querying, grant the DATA_READER role to your Snowflake user:
GRANT ROLE DATA_READER TO USER your-user>;
In production, you can manage this grant through Pulumi, manually, or automatically via SCIM provisioning from your identity provider.
No need to craft test payloads. Just interact with the test repo. Star it, push a commit, or open an issue, then wait about 30 seconds and query Snowflake using the least-privilege reader role:
USE ROLE DATA_READER;
SELECT CONTENT:github_event::STRING AS event_type,
CONTENT:payload:repository:full_name::STRING AS repo,
METADATA:IngestionTime::TIMESTAMP AS ingested_at
FROM LANDING_ZONE_WEBHOOKS.GITHUB.REPOSITORY_EVENTS_DIRECT
ORDER BY ingested_at DESC;
You should see rows with event types like star, push, or issues, real GitHub events flowing through the entire pipeline. The METADATA column includes Firehose metadata like IngestionTime, which you can use to track end-to-end latency.
Direct streaming is the fastest path, but two other patterns are available in the companion template for different requirements:
S3 auto-ingest via Snowpipe. Firehose buffers to S3, and Snowpipe auto-ingests new files. Latency is about two minutes. Best when you need S3 as the system of record or can’t use direct Snowpipe Streaming.
Batch loading. Your orchestrator (Airflow, Prefect, cron, etc.) runs COPY INTO on a schedule. Best for full control over timing and deduplication.
We’ll walk through both patterns in detail in upcoming posts.
Once your components are battle-tested, you can share them across teams and projects instead of copying files around.
The most straightforward approach: push your components to a Git repository with a PulumiPlugin.yaml file at the root:
runtime: python
name: snowpipe-components
version: 1.0.0
Consumers add the package to their project with pulumi package add:
pulumi package add github.com/your-org/pulumi-snowpipe@v1.0.0
Pulumi downloads the package and generates typed SDKs automatically. The consumer’s __main__.py imports your components as if they were local, but they’re versioned and pinned.
For organization-level discoverability, publish to the Pulumi Cloud Private Registry with pulumi package publish:
pulumi package publish ./schema.json
This gives you auto-generated API docs, usage tracking across teams, and cross-language SDK generation. Your Python components become usable from TypeScript, Go, and C# without any extra work. Teams browse available components in the Pulumi Cloud console, see who’s using what, and get notified when new versions are published. As an additional benefit, Neo will be able to use these components and build new pipelines in minutes from a natural language request.
ComponentResource is the key abstraction that makes this architecture scale. Instead of copying and pasting resources for each new data source, you instantiate a component with a handful of configuration parameters.
The DirectSnowflakeIngestion component in this post delivers data from GitHub webhooks into Snowflake in seconds: Lambda validates the HMAC signature, Firehose streams directly to Snowflake via the Snowpipe Streaming API, and the TLS key pair is managed entirely within Pulumi. No S3 intermediate, no SQS queues.
The component accepts pluggable Lambda handlers, so swapping GitHub for Stripe webhooks or any other source is just a matter of providing different lambda_code and lambda_environment arguments. We’ve been running this pattern in production for over three years across dozens of pipelines without significant changes to the infrastructure code.
You’ll find the complete example in the GitHub repository.
You can now control what happens when a resource fails during create, update, or delete—retry with backoff, fail fast, or handle errors in custom code. Last year, Pulumi IaC introduced the resource hooks feature, allowing you to run custom code at different points in the lifecycle of resources. Today we’re adding the onError hook so you can react when operations fail.
When a Pulumi program encounters an error while creating, updating, or deleting a resource, the operation halts and the error is reported back. Sometimes that’s not what we want—errors can be intermittent or temporary. If you’ve hit transient failures or resource-not-ready errors, the onError hook can help.
A common case is resource readiness: creating resources that depend on DNS propagation or the readiness of other servers. The program can fail simply because it ran too soon. Instead of failing, we can wait and retry. The example below shows how:
import * as pulumi from "@pulumi/pulumi";
const notStartedRetryHook = new pulumi.ErrorHook(
"retry-when-not-started",
async (args) => {
const latestError = args.errors[0] ?? "";
if (!latestError.includes("resource has not yet started")) {
return false; // do not retry, this is another type of error
}
await new Promise((resolve) => setTimeout(resolve, 5000));
return true; // retry
},
);
const res = new MyResource("res", {}, {
hooks: {
onError: [notStartedRetryHook],
},
});
import time
import pulumi
def retry_when_not_started(args: pulumi.ErrorHookArgs) -> bool:
latest_error = args.errors[0] if args.errors else ""
if "resource has not yet started" not in latest_error:
return False # do not retry, this is another type of error
time.sleep(5)
return True # retry
not_started_retry_hook = pulumi.ErrorHook(
"retry-when-not-started",
retry_when_not_started,
)
res = MyResource(
"res",
opts=pulumi.ResourceOptions(
hooks=pulumi.ResourceHookBinding(
on_error=[not_started_retry_hook],
),
),
)
package main
import (
"strings"
"time"
"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)
func main() {
pulumi.Run(func(ctx *pulumi.Context) error {
hook, err := ctx.RegisterErrorHook(
"retry-when-not-started",
func(args *pulumi.ErrorHookArgs) (bool, error) {
latest := ""
if len(args.Errors) > 0 {
latest = args.Errors[0]
}
if !strings.Contains(latest, "resource has not yet started") {
return false, nil // do not retry, this is another type of error
}
time.Sleep(5 * time.Second)
return true, nil // retry
},
)
if err != nil {
return err
}
_, err = NewMyResource(ctx, "res", &MyResourceArgs{}, pulumi.ResourceHooks(&pulumi.ResourceHookBinding{
OnError: []*pulumi.ErrorHook{hook},
}))
if err != nil {
return err
}
return nil
})
}
using System;
using System.Threading;
using System.Threading.Tasks;
using Pulumi;
class ErrorHookStack : Stack
{
public ErrorHookStack()
{
var retryHook = new ErrorHook(
"retry-when-not-started",
async (args, cancellationToken) =>
{
var latestError = args.Errors.Count > 0 ? args.Errors[0] : "";
if (!latestError.Contains("resource has not yet started"))
{
return false; // do not retry, this is another type of error
}
await Task.Delay(TimeSpan.FromSeconds(5), cancellationToken);
return true; // retry
});
var res = new MyResource("res", new MyResourceArgs(), new CustomResourceOptions
{
Hooks = new ResourceHookBinding
{
OnError = { retryHook },
},
});
}
}
Each time the operation fails, the hook receives the new error plus all previous attempts’ errors (newest first). The hook returns true or false to tell Pulumi whether to retry. If you return false, the program fails as normal with the most recent error. With that information, you can implement many failure models:
Use the number of errors to implement backoff—for example, wait one second on the first failure, two seconds on the second, and so on.
For known-intermittent resources, always retry once before failing.
Inspect error text to retry only for specific conditions (as in the example) and fail fast for others.
The callback runs in your language of choice, so you have full control over how failures are handled.
This feature is fully supported in our Node, Python, Go, and .NET SDKs as of v3.219.0. For more information, see the hooks documentation.
**
Java and YAML do not support resource hooks.
Thanks for reading, and feel free to reach out with any questions via GitHub, X, or our Community Slack.
Getting started with GitOps can feel like trying to herd cats through a YAML factory while the factory is on fire. It’s one of those things that seems like it ought to be simple (just use Git!), but in practice is much more complex — and you may not realize how much more complex until you’re weeks or more into a project. After years of running GitOps workflows in production across dozens of clusters, I’ve collected a list of best practices that I’m hoping can save you from having to make many of the mistakes I’ve made. Think of it as the GitOps cheat sheet I wish I’d had from Day 1.
If you’re not familiar with the formal definition, the OpenGitOps project distills it into four principles:
Declarative desired state
Versioned and immutable storage
Automatic pulling
Continuous reconciliation
But those principles only define the what. This post is also about the how — the practical lessons that can make or break a GitOps implementation.
In this post, I’ll walk you through the GitOps best practices I’ve picked up from production experience, community talks, and more than a few late-night incident calls. Whether you’re just getting started with GitOps or looking to level up, these tips should help you avoid the potholes.
The GitOps mindset in a nutshell.
This is the bedrock that the principles of GitOps are built on: every piece of your environment’s desired state lives in a Git repository. No exceptions. No “I’ll just fix it real quick with kubectl edit.” No “let me patch this configmap by hand because the PR process takes too long.”
The moment you make a manual change to your cluster, you’ve created drift between what Git says the world should look like and what it actually looks like. Drift will quietly wreck your GitOps workflows if you let it.
Put everything in Git: Kubernetes manifests, Helm values, Kustomize overlays, policy rules, even your GitOps tool configuration itself.
No manual kubectl edits. If it’s not in Git, it doesn’t exist. Period. Train your team to treat direct cluster changes like touching a hot stove.
You get an audit trail for free. Git gives you a complete history of who changed what, when, and why. That’s your compliance audit trail baked right in.
**
Pro Tip: Enable branch protection rules on your GitOps repos from Day 1. This prevents anyone from pushing directly to main and bypassing the review process. Future you will thank present you during the next audit.
If you’re still running sequences of kubectl create, kubectl patch, and kubectl delete commands to manage your cluster, you’re not really doing GitOps yet. Declarative means you define what you want the end state to look like, not the step-by-step instructions to get there.
Think of it like ordering at a restaurant. Imperative: “Go to the kitchen, grab flour, knead dough, preheat the oven to 200 degrees, shape the pizza, add sauce…” Declarative: “One margherita pizza, please.” Let the system figure out how to make it happen.
Your manifests describe the end result. The GitOps operator reconciles reality to match.
It’s idempotent by design. Apply the same manifest ten times, get the same result. No side effects, no surprises.
Rollbacks are easier: just revert a commit. The operator sees the previous desired state and reconciles. A caveat though: this only works for stateless resources; database schema migrations, CRD version changes, persistent volume modifications, and rotated secrets don’t always revert cleanly by rolling back to a previous commit. Be sure to plan rollbacks involving stateful resources carefully.
**
Pro Tip: If you find yourself writing shell scripts that run a sequence of kubectl commands, stop and ask yourself: “Can I express this as a declarative manifest instead?” Nine times out of ten, the answer is yes.
Traditional CI/CD is push-based: your pipeline builds an artifact and then pushes it to the cluster. GitOps flips this. An agent running inside your cluster continuously polls a Git repository and pulls changes when it detects them.
Why does this matter? With push-based deployments, your CI system needs credentials to access your cluster. That’s a wide attack surface. With pull-based, the agent already lives inside the cluster and only needs read access to your Git repo.
Tools like ArgoCD and FluxCD run controllers inside your cluster that watch your Git repos.
Your CI pipeline never needs kubeconfig access. The agent handles deployment.
The agent doesn’t just deploy once; it continuously ensures the cluster matches the desired state in Git.
**
Pro Tip: You may be tempted to set your reconciliation interval to something aggressive like 1 minute so you always know exactly which version is deployed. That works for a while, but at scale (200+ applications polling every minute) you can blow through GitHub’s API rate limits (5,000 requests/hour for authenticated users) and put real pressure on the Kubernetes API server.
A better approach: set up webhook receivers. Both ArgoCD and FluxCD support incoming webhooks from GitHub, GitLab, and Bitbucket. Your Git provider pings the GitOps controller on every push, so reconciliation kicks off in seconds instead of waiting for the next poll. That alone kills most of the API rate limit pain at scale. You can then relax polling to a 5 or 10 minute fallback for the rare case where a webhook doesn’t fire.
Let me be upfront: if you’re a small team with one or two services, keeping your Kubernetes manifests next to your application source code in the same repo is fine. A monorepo is simpler to manage and one less thing to automate. Don’t split repos just because a blog post told you to.
That said, ArgoCD’s official best practices call separate repos “highly recommended,” and there are real reasons for that once you grow. Application code and deployment configuration have different lifecycles. You might bump resource limits in a Helm values file without touching a single line of app code. Or you might refactor your entire codebase without changing any deployment parameters. In a shared repo, every config tweak triggers your full CI pipeline, and ArgoCD invalidates the manifest cache for all applications in the repo on every commit, not just the ones that changed. That cache invalidation alone can become a performance problem with dozens of apps in one repo.
When config tweaks start blocking app code from getting through CI, it may be time to split.
When different teams need different access levels (not everyone who pushes app code should modify production deployment config), it’s time to split.
When you’re running microservices with independent release cadences that step on each other, it’s time to split.
If none of those apply yet, don’t split. You’ll know when the pain arrives.
**
Pro Tip: When you do split, a common pattern is two repos: one for application source code, one for deployment configuration (Helm charts, Kustomize overlays, environment-specific values). Some larger orgs go further with a third repo for environment overrides, following the fleet repo model from the FluxCD maintainers. Either way, pair it with image update automation (Flux Image Automation Controller or ArgoCD Image Updater) so image tag bumps across repos don’t turn into manual toil.
This is the number one anti-pattern I encounter in the wild. Teams create a dev branch, a staging branch, and a prod branch, then cherry-pick or merge between them for promotions. It sounds logical, but it’s a trap. Branches diverge, cherry-picks get missed, configs meant for dev sneak into staging or prod, and before you know it, your environments have drifted, and you can’t tell what’s actually different between them.
A folder structure like environments/dev/, environments/staging/, environments/prod/ makes differences visible in a single diff command.
Moving a change from dev to prod is just copying or updating files across directories, reviewed via a standard pull request.
You’ll never accidentally skip a commit or introduce an environment-specific change where it doesn’t belong. No more cherry-pick roulette.
One thing to note: the commonly used rendered manifests pattern actually does use per-environment branches, but not in a human-managed way. In that pattern, config changes are still committed to main (using named directories as above), and an automated CI process pushes environment YAML to read-only environment branches, which ArgoCD reads from. If you want to try it, ArgoCD’s built-in source hydrator and Kargo Render can automate the rendering for you.
To keep things DRY, you can look to a tool like Kustomize, which lets you use a shared base environment and then apply per-environment patches. Your base directory holds the common manifests, and each environment directory contains only the differences. For example:
gitops-repo/
├── base/
│ ├── deployment.yaml
│ ├── service.yaml
│ └── kustomization.yaml
├── environments/
│ ├── dev/
│ │ ├── kustomization.yaml # replicas: 1, limits: 256Mi
│ │ └── patches/
│ ├── staging/
│ │ ├── kustomization.yaml # replicas: 2, limits: 512Mi
│ │ └── patches/
│ └── prod/
│ ├── kustomization.yaml # replicas: 3, limits: 1Gi
│ └── patches/
This minimizes duplication while keeping environment boundaries clear.
**
Pro Tip: If you’re working at the IaC layer, Pulumi stack configuration solves a similar problem: each stack (dev, staging, prod) has its own config file with environment-specific values, while the program logic stays shared. It’s the same principle of “one source, per-environment overrides” applied above the Kubernetes manifest level.
Catching errors after they’ve been applied to your cluster is expensive. Catching them in CI before the merge is cheap. Shift left as hard as you can.
Your GitOps CI pipeline should validate everything it can before the change reaches your cluster: YAML syntax, Kubernetes schema validation, policy compliance, dry-run rendering.
Tools like yamllint and kubeconform catch syntax errors and schema violations before they become runtime failures. (Note: kubeval, which you’ll see in older guides, is no longer maintained — its own README points to kubeconform as the replacement.)
Run helm template or kustomize build in CI to verify that your templates render without errors.
Use OPA/Conftest or Kyverno to enforce organizational policies (no privileged containers, required labels, resource limits set) before merge.
Here’s a minimal GitHub Actions workflow that validates your GitOps manifests on every PR:
name: Validate GitOps Manifests
on:
pull_request:
paths: ["environments/**"]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate YAML schemas
run: kubeconform --strict environments/**/*.yaml
- name: Build and verify Kustomize output
run: |
for env in environments/*/; do
kustomize build "$env" > /dev/null
done
- name: Diff against live cluster
run: |
kustomize build environments/staging/ | \
kubectl diff -f - || true
**
Pro Tip: Add a kubectl diff step in your CI pipeline that shows exactly what would change in the cluster if the PR were merged. This gives reviewers a concrete view of the impact, not just the YAML diff.
Using latest as your image tag in a GitOps workflow is playing fast and loose with your deployments. You have no idea what version is actually running, you can’t reliably roll back, and your Git history becomes meaningless because the same manifest could deploy completely different code depending on when it was synced.
Tag your container images with the Git commit SHA that produced them. This creates a direct link between your source code, your build, and your deployment.
Once a tag is pushed, it should never be overwritten. No re-tagging, no “oops let me push a fix to the same tag.”
Need to roll back? Just revert the commit that updated the image tag. The previous SHA still points to the exact same image it always did.
**
Pro Tip: Set up an admission controller or CI policy that rejects any manifest using latest or any other mutable tag. Make it impossible to deploy without a pinned version.
Drift is when your cluster’s actual state doesn’t match the desired state in Git. It happens when someone runs a manual kubectl command, when an autoscaler changes a replica count, or when a CRD controller modifies a resource behind your back.
The whole point of GitOps is continuous reconciliation. Your operator should constantly compare the live state against Git and bring things back in line.
Configure your GitOps tool to automatically revert manual changes, but build your exclusion lists (the set of fields and resources you intentionally skip during reconciliation) before you turn on auto-sync, not after. Controllers like Istio, cert-manager, and Crossplane legitimately modify resources, and auto-remediating those changes creates reconciliation loops that can destabilize a cluster. Start with a conservative exclusion list and tighten it over time.
Even with auto-remediation, you want to know when drift happens. Set up alerts. It might indicate a process problem or a team member who needs coaching.
**
Pro Tip: Be precise about what you reconcile. HPAs, VPAs, cert-manager annotations, and external-dns records all legitimately modify resources. Use ignore rules to exclude fields that are intentionally dynamic. In ArgoCD, that looks like this:
spec:
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas
- group: admissionregistration.k8s.io
kind: MutatingWebhookConfiguration
jqPathExpressions:
- .webhooks[]?.clientConfig.caBundle
Also, don’t overlook resource ordering. ArgoCD sync waves and Flux’s dependsOn let you enforce that CRDs are applied before custom resources and namespaces before workloads. Getting ordering wrong is one of the most common causes of failed syncs in production.
Progressive delivery lets you gradually roll out changes, verify they’re working, and automatically roll back if something goes wrong.
GitOps and progressive delivery fit well together. Your Git repo describes the desired rollout strategy, and controllers in the cluster execute it.
Canary deployments: send a small percentage of traffic to the new version. If error rates spike, automatically roll back.
Blue-green deployments: run two identical environments, switch traffic once the new one is verified healthy.
Argo Rollouts extends Kubernetes with canary, blue-green, and analysis-driven rollout strategies that integrate with ArgoCD.
Flagger is the Flux ecosystem equivalent, with automated canary analysis, A/B testing, and blue-green deployments using service mesh or ingress controllers.
Kargo takes it a step further by orchestrating promotions across multiple stages and environments. Instead of manually wiring up pipelines to move changes from dev to staging to prod, Kargo automates the entire promotion workflow on top of ArgoCD. I did a deep dive on GitOps promotion tools and why they belong in your toolkit:
Video
**
Pro Tip: Combine progressive delivery with automated analysis. Tools like Argo Rollouts and Kargo can query Prometheus metrics during a canary and automatically promote or roll back based on error rates, latency percentiles, or custom metrics.
Reviews are great, but humans make mistakes. Policy-as-code adds an automated layer that enforces organizational rules before a change can land in your cluster. Think of it as guardrails on a mountain road: they don’t slow you down, they just keep you from going over the edge.
OPA/Gatekeeper lets you define policies in Rego that the admission controller enforces at deploy time (e.g., no containers running as root, all deployments must have resource limits).
Kyverno is a Kubernetes-native policy engine that uses familiar YAML to define and enforce policies, including mutation and generation.
Starting with Kubernetes 1.26+, you can write CEL-based ValidatingAdmissionPolicies natively without any external tooling. A lightweight option if you don’t need the full feature set of OPA or Kyverno.
Pulumi CrossGuard lets you write policy packs in TypeScript or Python that validate infrastructure before it’s deployed. If your IaC and GitOps layers are bridged (see the next section), CrossGuard can enforce rules across both.
Run policy checks in your CI pipeline so violations are caught during the PR, not after deployment.
**
Pro Tip: Start with a small set of high-impact policies (required labels, no privileged containers, resource limits) and expand over time. Trying to enforce 50 policies on Day 1 will create so much friction that teams will revolt.
I hear this all the time: “We use Terraform for infrastructure and ArgoCD for deployments, and they live in completely separate worlds.” Sound familiar? Most teams treat IaC and GitOps as independent workflows, but they’re really two halves of the same story.
IaC handles “Day 0,” creating the cloud resources your cluster depends on: VPCs, the Kubernetes cluster itself, IAM roles, databases, DNS zones. GitOps handles “Day 2,” managing what runs inside the cluster: applications, addons, configurations. The gap between them is where things get messy. Your Helm charts need IAM role ARNs. Your GitOps-deployed addons need to know the cluster’s OIDC provider endpoint. That metadata has to flow from the IaC layer to the GitOps layer somehow.
The gitops-bridge pattern solves this cleanly: IaC provisions cloud resources and writes metadata (role ARNs, account IDs, endpoints) into Kubernetes resources (like ConfigMaps or Secrets) that your GitOps tool can consume directly.
Keep each tool in its lane. Don’t force Terraform’s Helm provider to fight with ArgoCD over resource ownership.
Pulumi lets you define your cloud infrastructure in real programming languages (TypeScript, Python, Go), which makes it natural to compute and pass metadata to your GitOps layer. You can use stack outputs to expose values like role ARNs and cluster endpoints, then feed them into your GitOps manifests. The Pulumi Kubernetes Operator even lets ArgoCD reconcile Pulumi stacks via GitOps.
Statsig went from “1-2 devs clicking around cloud consoles with SEVs left and right” to fully self-service by using Pulumi to generate manifests and ArgoCD to deploy them.
As you grow beyond a handful of clusters, think multi-cluster from the start. Patterns like ArgoCD ApplicationSets with cluster generators and Flux’s Kustomization targeting become important for fleet management. The gitops-bridge pattern works well here because your IaC layer can register new clusters and write their metadata into Kubernetes resources, which ApplicationSets automatically pick up.
Here’s how the gitops-bridge looks in practice. Pulumi provisions cloud resources and writes metadata into a ConfigMap that ArgoCD consumes:
import * as k8s from "@pulumi/kubernetes";
const clusterMetadata = new k8s.core.v1.ConfigMap("cluster-metadata", {
metadata: {
name: "cluster-metadata",
namespace: "argocd",
labels: {
"gitops-bridge": "true",
},
},
data: {
aws_account_id: "123456789012",
cluster_name: "prod-us-east-1",
iam_role_arn: "arn:aws:iam::123456789012:role/app-role",
oidc_provider: "oidc.eks.us-east-1.amazonaws.com/id/EXAMPLE",
},
});
import pulumi_kubernetes as k8s
cluster_metadata = k8s.core.v1.ConfigMap(
"cluster-metadata",
metadata=k8s.meta.v1.ObjectMetaArgs(
name="cluster-metadata",
namespace="argocd",
labels={
"gitops-bridge": "true",
},
),
data={
"aws_account_id": "123456789012",
"cluster_name": "prod-us-east-1",
"iam_role_arn": "arn:aws:iam::123456789012:role/app-role",
"oidc_provider": "oidc.eks.us-east-1.amazonaws.com/id/EXAMPLE",
},
)
Your ArgoCD Applications or Helm values can then reference this ConfigMap, closing the loop between IaC and GitOps without hardcoding cloud-specific values.
**
Pro Tip: If you’re using Terraform today, the gitops-bridge pattern works there too. But if you’re tired of wrangling HCL and want type-safe infrastructure code with real programming constructs, give Pulumi a look. The bridge from IaC to GitOps becomes much more natural when both sides speak a real programming language.
Michael Crenshaw from Intuit gave a talk titled “How GitOps Should I Be?” at a CNCF conference. Intuit runs 45 ArgoCD instances managing 20,000 applications across 200 clusters. The approach they took: “Be strategic, be pragmatic. GitOps most of the time.”
One of the largest GitOps adopters in the world explicitly deviates from pure GitOps when the situation calls for it. Dogmatic adherence to principles that don’t serve you is just another form of technical debt.
Intuit found that GitHub’s SLA of 20 minutes meant they couldn’t rely on GitOps for region evacuation, so they wrote a kubectl cron job that bypasses Git for failover. Pragmatic.
Massive Jenkins ecosystems don’t get rewritten overnight. Intuit’s compromise: the last step in the Jenkins pipeline calls argocd app sync. It’s push-based, yes, but the source of truth stays in Git. The Jenkins pipeline doesn’t own the desired state; it just tells ArgoCD “go reconcile now” instead of waiting for the next poll. It’s a stepping stone while teams migrate off Jenkins, not a permanent architecture.
Continuous reconciliation with self-heal (auto-reverting manual changes) is where most teams deviate. In practice, many teams start with automated drift detection and alerting (so you know when someone runs a manual kubectl command) but leave auto-remediation off until they’ve built confidence in their exclusion lists. That’s a reasonable middle ground.
**
Pro Tip: Document your intentional deviations. Create an ADR (Architecture Decision Record) for each case where you’ve chosen not to follow pure GitOps, explaining why and what the plan is to close the gap (if ever). This turns exceptions into conscious decisions rather than accidental shortcuts.
GitOps isn’t a silver bullet, but it’s one of the best approaches we have for managing Kubernetes at scale. The practices in this list aren’t theoretical. They’re what I’ve learned from running GitOps in production, watching talks from teams at Intuit and Statsig, and making every mistake in the book at least once.
If I had to boil it all down to one sentence: treat your GitOps repo like production code, because it is.
Start with the basics (Git as source of truth, declarative config, pull-based delivery) and layer on the advanced practices as your team matures. Don’t try to implement all 12 on Day 1. Pick the three that address your biggest pain points and iterate from there.
And remember, even the largest GitOps adopters in the world aren’t doing it perfectly. They’re being pragmatic. You should be too.
Several of the practices in this post touch on problems Pulumi is built to solve. If you’re bridging IaC and GitOps, managing environment-specific configuration, or enforcing policy across both layers, here’s where to start:
Pulumi IaC: Define cloud infrastructure in TypeScript, Python, or Go and use stack outputs to feed metadata into your GitOps manifests.
Pulumi ESC: Manage secrets and configuration across environments with fine-grained access controls, integrated with Kubernetes via the External Secrets Operator or the Secret Store CSI Driver.
Pulumi Kubernetes Operator: Let ArgoCD reconcile Pulumi stacks via GitOps, closing the loop between your IaC and Day 2 workflows.
Many organizations have years of infrastructure built and managed with Terraform. Outputs such as VPC IDs, subnet lists, database endpoints, and cluster names are the connective tissue between infrastructure layers. Getting those values into other tools and workflows often means manual copy-paste, wrapper scripts, or brittle glue code.
The terraform-state provider for Pulumi ESC helps bridge that gap.
It reads outputs directly from your Terraform state files and makes them available as first-class values in your ESC environments — no scripts, no duplication, no drift.
Any output marked as sensitive in your Terraform state is automatically treated as a secret in ESC.
If you’ve used pulumi-stacks to read outputs from Pulumi stacks, this is the same idea for Terraform.
The terraform-state provider uses fn::open::terraform-state to read from a Terraform state file and surface its outputs as ESC values.
Here’s an example that reads from an S3 backend, using the aws-login provider for credentials, and exports a KUBECONFIG for an EKS cluster managed by Terraform:
values:
terraform:
fn::open::terraform-state:
backend:
s3:
login:
fn::open::aws-login:
oidc:
roleArn: arn:aws:iam::123456789012:role/esc-oidc
sessionName: pulumi-environments-session
bucket: my-terraform-state-bucket
key: path/to/terraform.tfstate
region: us-west-2
files:
KUBECONFIG: ${terraform.outputs.kubeconfig}
Once the environment is opened, terraform.outputs contains every output from the Terraform state.
In this example we take the kubeconfig output from a Terraform-managed EKS cluster and project it as a file,
so any tool that reads KUBECONFIG - kubectl, helm, Pulumi - just works.
You can also reference outputs in pulumiConfig to pass values like VPC IDs and subnet lists directly into Pulumi stacks.1
If your state lives in Terraform Cloud (or any compatible remote backend), the provider supports that too:
values:
terraform:
fn::open::terraform-state:
backend:
remote:
organization: my-terraform-org
workspace: my-workspace
token:
fn::secret: tfc-token-value
pulumiConfig:
vpcId: ${terraform.outputs.vpc_id}
subnetIds: ${terraform.outputs.subnet_ids}
You can point it at any Terraform Cloud-compatible backend by setting the optional hostname property.
Check out the full terraform-state provider documentation for the complete reference.
You can also consume Terraform outputs directly in a Pulumi program with the Pulumi Terraform provider. ↩︎
Managing database credentials is one of the persistent challenges in cloud infrastructure. Passwords need to be rotated, secrets need to be stored securely, and access needs to be carefully controlled. AWS IAM authentication for RDS offers a better way: instead of managing long-lived passwords, your applications authenticate using short-lived tokens generated from IAM credentials. This approach is more secure, eliminates password rotation overhead, and integrates seamlessly with your existing IAM policies. With Pulumi, you can set up this entire system using reusable components that make IAM authentication a standard part of your infrastructure.
Traditional database authentication relies on usernames and passwords. These credentials need to be stored somewhere secure, rotated regularly, and distributed to applications that need them. Each of these steps introduces complexity and potential security risks.
IAM authentication changes this model fundamentally. Instead of passwords, your applications generate authentication tokens on demand using their IAM credentials. These tokens are valid for only 15 minutes, eliminating the need for password rotation. Access control happens through IAM policies, which you’re already using to manage other AWS resources. The result is a more secure system that’s easier to maintain and audit.
The benefits become even more pronounced when you componentize this setup with Pulumi. Rather than repeating the same configuration steps for each database, you can build reusable components that handle IAM authentication setup automatically. This turns a complex, multi-step process into a simple, repeatable pattern that your entire team can use. In this post, we’ll cover an example that sets up the complete flow: an RDS cluster with IAM authentication, the necessary IAM roles and policies, and a Kubernetes application that connects using IAM tokens.
This example sets up a complete environment with IAM-authenticated database access from a Kubernetes application. The architecture includes:
VPC with public and private subnets across multiple availability zones
EKS cluster for running containerized applications
Aurora PostgreSQL cluster with IAM authentication enabled
IAM roles and policies that connect Kubernetes service accounts to database access
A demo application that authenticates to the database using IAM tokens
The key integration point is IAM Roles for Service Accounts (IRSA), which allows Kubernetes pods to assume IAM roles. This means your application doesn’t need any AWS credentials stored in environment variables or mounted secrets. The pod’s service account automatically has the permissions it needs to generate database authentication tokens.
The foundation of this setup is an RDS cluster configured to accept IAM authentication. With Pulumi’s componentized approach, you can encapsulate all the RDS configuration in a reusable component:
const rdsCluster = new RdsCluster("iam-postgres", {
vpcId: vpc.vpc.id,
vpcCidrBlock: vpc.vpc.cidrBlock,
subnetIds: vpc.publicSubnets.map((s) => s.id),
databaseName: DATABASE_NAME,
masterUsername: MASTER_USERNAME,
masterPassword: dbMasterPassword,
instanceClass: "db.t4g.medium",
engineVersion: "17.4",
iamDatabaseUser: IAM_DB_USERNAME,
});
Inside the component, the critical configuration is iamDatabaseAuthenticationEnabled:
this.cluster = new aws.rds.Cluster(
`${name}-cluster`,
{
engine: "aurora-postgresql",
engineVersion: engineVersion,
databaseName: args.databaseName,
masterUsername: args.masterUsername,
masterPassword: args.masterPassword,
dbSubnetGroupName: subnetGroup.name,
vpcSecurityGroupIds: [this.securityGroup.id],
iamDatabaseAuthenticationEnabled: true,
skipFinalSnapshot: true,
tags: { Name: `${name}-cluster` },
},
{ parent: this }
);
This single flag tells RDS to accept authentication tokens in addition to traditional passwords. The cluster still needs a master password for initial setup and administrative tasks, but your applications can use IAM authentication instead.
Enabling IAM authentication on the cluster isn’t enough. You also need to create a database user and grant it the special rds_iam role that allows IAM authentication. This is where Pulumi’s PostgreSQL provider comes in:
const dbSetup = new DbSetup(
"iam-postgres",
{
dbEndpoint: rdsCluster.cluster.endpoint,
dbName: DATABASE_NAME,
masterUsername: MASTER_USERNAME,
masterPassword: dbMasterPassword,
iamUsername: IAM_DB_USERNAME,
},
{ dependsOn: [rdsCluster.instance] }
);
The DbSetup component handles all the database-level configuration. It creates the IAM user and grants the necessary permissions:
// Create IAM-enabled database user
this.iamRole = new postgresql.Role(
`${name}-iam-user`,
{
name: args.iamUsername,
login: true,
},
{ parent: this, provider: this.provider }
);
// Grant rds_iam role to enable IAM authentication
new postgresql.GrantRole(
`${name}-grant-rds-iam`,
{
role: this.iamRole.name,
grantRole: "rds_iam",
},
{ parent: this, provider: this.provider }
);
The component also grants the necessary database privileges (CONNECT, USAGE, CREATE, and table operations). By encapsulating this in a component, you ensure that every IAM-authenticated database user is set up consistently with the correct permissions.
The next piece is the IAM role and policy that allows applications to generate authentication tokens. This involves two parts: the role that the application assumes, and the policy that grants database access.
For Kubernetes applications, this uses IRSA (IAM Roles for Service Accounts). The RdsCluster component includes a method that creates the appropriate IAM role:
const rdsIamRole = rdsCluster.createIamRole(
eksCluster.oidcProvider,
NAMESPACE,
SERVICE_ACCOUNT_NAME
);
This creates an IAM role with a trust policy that allows the specified Kubernetes service account to assume it:
assumeRolePolicy: pulumi
.all([oidcProvider.arn, oidcProvider.url])
.apply(([arn, url]) =>
JSON.stringify({
Version: "2012-10-17",
Statement: [
{
Effect: "Allow",
Principal: { Federated: arn },
Action: "sts:AssumeRoleWithWebIdentity",
Condition: {
StringEquals: {
[`${url}:sub`]: `system:serviceaccount:${namespace}:${serviceAccountName}`,
},
},
},
],
})
),
The role is then granted the rds-db:connect permission for the specific database user:
{
Version: "2012-10-17",
Statement: [
{
Effect: "Allow",
Action: ["rds-db:connect"],
Resource: `arn:aws:rds-db:${region}:${accountId}:dbuser:${resourceId}/${iamDatabaseUser}`,
},
],
}
This policy is scoped to exactly one database user on one cluster. This level of granularity is one of the security advantages of IAM authentication: you can control database access with the same precision you use for other AWS resources.
The application running in Kubernetes needs a service account annotated with the IAM role ARN:
this.serviceAccount = new k8s.core.v1.ServiceAccount(
`${name}-sa`,
{
metadata: {
name: args.serviceAccountName,
namespace: args.namespace,
annotations: {
"eks.amazonaws.com/role-arn": args.iamRoleArn,
},
},
},
{ parent: this, provider: args.provider }
);
When a pod uses this service account, EKS automatically configures the pod with temporary AWS credentials for the associated IAM role. The application can then use these credentials to generate database authentication tokens.
The authentication flow happens at connection time:
%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#FF9900','primaryTextColor':'#232F3E','primaryBorderColor':'#232F3E','lineColor':'#666','secondaryColor':'#527FFF','tertiaryColor':'#fff'}}}%% sequenceDiagram autonumber participant App as Application Pod participant SDK as AWS SDK participant STS as AWS STS participant IAM as AWS IAM participant RDS as RDS Aurora rect rgb(255, 249, 230) Note over App,SDK: Phase 1: Token Generation activate App App->>+SDK: generate_db_auth_token() SDK->>+STS: Get temporary credentials (IRSA) STS-->>-SDK: Return AWS credentials SDK->>SDK: Sign auth request SDK-->>-App: Return IAM token (15 min TTL) deactivate App end rect rgb(235, 245, 255) Note over App,RDS: Phase 2: Database Connection activate App App->>+RDS: Connect (user + IAM token) activate RDS RDS->>+IAM: Validate token & permissions IAM-->>-RDS: Token valid + rds-db:connect allowed RDS->>RDS: Verify rds_iam role RDS-->>-App: Connection established deactivate App end Note over App,RDS: 💡 Token expires after 15 minutes - generate new token per connection
The application uses the AWS SDK to call generate_db_auth_token(), passing the database endpoint, port, and username
The SDK uses the pod’s IAM credentials (provided automatically by IRSA) to sign the request and generate a token
The application connects to PostgreSQL using the IAM username and the token as the password
RDS validates the token against IAM, verifying that the caller has rds-db:connect permission for that database user
If the token is valid, the connection is established
Here’s what this looks like in Python:
def get_iam_token():
"""Generate an IAM authentication token for RDS"""
client = boto3.client('rds', region_name=AWS_REGION)
token = client.generate_db_auth_token(
DBHostname=DB_ENDPOINT,
Port=DB_PORT,
DBUsername=DB_USER,
Region=AWS_REGION
)
return token
def get_db_connection():
"""Create a database connection with IAM auth"""
token = get_iam_token()
return psycopg2.connect(
host=DB_ENDPOINT,
port=DB_PORT,
database=DB_NAME,
user=DB_USER,
password=token,
sslmode='require',
sslrootcert=RDS_CA_CERT
)
The token is valid for 15 minutes. Applications should generate a new token for each connection or implement token caching with refresh logic.
The example includes an interactive demo application that lets you test the IAM authentication setup. Once deployed, the application provides a web interface for creating tables, adding messages, and querying the database:
Deploy the infrastructure and access the demo app:
pulumi up
export APP_URL=$(pulumi stack output appUrl)
echo "Demo app: http://$APP_URL"
The application shows the database endpoint and IAM username it’s using, and provides buttons to create tables and add data. Behind the scenes, every database operation authenticates using IAM tokens, demonstrating that the entire authentication flow is working correctly.
**
The full code for this example is available at https://github.com/pulumi-demos/examples/tree/main/typescript/aws-iam-for-postgres. The example includes all the component code referenced in this post.
This example demonstrates IAM authentication in a working environment, but you’ll want to make several adjustments for production use:
Network security: The example places RDS in public subnets to allow the PostgreSQL provider to connect during deployment. In production, place RDS in private subnets and ensure your deployment environment (like GitHub Actions or Pulumi Cloud) can reach the database for initial setup, either through a bastion host or VPN.
Connection pooling and performance: IAM tokens expire after 15 minutes. If you’re using connection pooling, you’ll need logic to refresh tokens before they expire. AWS recommends using IAM authentication only when your application creates fewer than 200 new connections per second. For higher connection rates, consider using Amazon RDS Proxy, which manages connection pooling and can reduce the overhead of IAM token generation.
Master password management: You still need a master password for database administration and for the initial user setup. Store this in AWS Secrets Manager or Pulumi ESC, and restrict access to it carefully.
Monitoring and auditing: Token generation via the AWS SDK is logged in CloudTrail. Note that CloudWatch and CloudTrail do not log database authentication attempts themselves, so you’ll need to rely on PostgreSQL’s native logging for connection monitoring.
Multi-region considerations: If your application runs in multiple regions, ensure that IAM policies and database access work correctly across regions. Token generation and validation must happen in the same region as the RDS cluster.
Cost: IAM authentication itself is free, but be aware that cross-AZ data transfer costs still apply for database connections.
IAM authentication for PostgreSQL transforms database credential management from a security burden into a seamless part of your infrastructure. By eliminating long-lived passwords, you reduce your attack surface and remove the operational overhead of password rotation. By integrating with IAM, you gain fine-grained access control and comprehensive audit logs.
Pulumi makes this setup practical and repeatable through componentization. The components in this example encapsulate the complexity of IAM authentication setup, making it easy to apply the same pattern across multiple databases and applications. This turns security best practices into your team’s default approach.
Whether you’re building a new application or improving the security posture of existing infrastructure, IAM authentication for RDS deserves consideration. The investment in setup pays dividends in security, maintainability, and operational simplicity.
Pulumi ESC environments can now validate configuration values against JSON Schema with the new fn::validate built-in function. Invalid configurations are caught immediately when you save, preventing misconfigurations from reaching your deployments.
Configuration errors are often discovered too late during deployment or, worse, in production. With fn::validate, you define validation rules directly in your environment, and ESC enforces them at save time. If a value doesn’t match its schema, the environment cannot be saved until the issue is resolved.
The fn::validate function takes a JSON Schema and a value. If the value conforms to the schema, it passes through unchanged. If not, ESC raises a validation error.
values:
port:
fn::validate:
schema: { type: number, minimum: 1, maximum: 65535 }
value: 8080
This validates that port is a number between 1 and 65535. The evaluated result is simply 8080.
For complex configurations, you can enforce structure and required fields:
values:
database:
fn::validate:
schema:
type: object
properties:
host: { type: string }
port: { type: number }
name: { type: string }
required: [host, port, name]
value:
host: "db.example.com"
port: 5432
name: "myapp"
If any required field is missing or has the wrong type, the environment cannot be saved.
Define schemas once and reference them across multiple environments. Using the environments built-in property keeps the schema out of your environment’s output:
Schema environment (my-project/schemas)
values:
database-schema:
type: object
properties:
host: { type: string }
port: { type: number }
required: [host, port]
Environment using the schema
values:
database:
fn::validate:
schema: ${environments.my-project.schemas.database-schema}
value:
host: "prod-db.example.com"
port: 5432
This pattern ensures consistent validation rules across teams and projects.
When a value doesn’t conform to its schema, ESC returns a clear error message:
values:
port:
fn::validate:
schema: { type: string }
value: 8080
This raises: expected string, got number. The environment cannot be saved until you fix the value or update the schema.
Enable fn::validate for:
Values with specific type requirements (numbers, strings, arrays)
Objects that must have certain fields present
Numbers that must fall within a valid range
Configurations shared across multiple environments
Any value where catching errors early prevents downstream issues
The fn::validate function is available now in all Pulumi ESC environments. Add schema validation to your existing environments or use it when creating new ones.
For more information, see the fn::validate documentation.
Running multiple providers with different credentials in the same Pulumi program has always been tricky.
Providers expect fixed environment variable names like AWS_ACCESS_KEY_ID or ARM_CLIENT_SECRET, so if you need two AWS providers targeting different accounts, you couldn’t configure them both via environment variables.
Pulumi v3.220.0 introduces envVarMappings, a new resource option that solves this problem by letting you remap provider environment variables to custom keys.
When configuring a Pulumi provider to authenticate against a cloud provider, there are two main options available. You can set authentication values as secrets in your Pulumi.yaml config:
$ pulumi config set azure-native:clientSecret --secret
Alternately, you can also use the terminal environment of where you’re running your Pulumi commands:
$ export ARM_CLIENT_SECRET=1234567
Using environment variables in this manner is especially useful in CI environments, or when you’d rather not write that auth token to state, even encrypted. But there’s currently several use cases where this breaks down, due to the hard-coded nature of the environment variables that a given provider expects.
For example, if you are using multiple explicit providers targeting different Azure accounts, you were not able to set these separate configurations via environment variable.
Instead, users would have to to set these values in the provider config, which may not be desirable for all use cases. Not only does the provider config write secrets to state (albeit of course encrypted), but it can also result in a noisy diff on an otherwise no-op upgrade when token rotation is used.
For this and similar scenarios, we have a new solution for you: setting mappings of environment variable keys on your provider. The concept is as follows:
“For any environment variable that my Pulumi provider expects, I want to be able to tell the provider to use the value of a custom-defined environment variable instead.”
Let’s say your provider expects ARM_CLIENT_SECRET, but you want it to use a different value than the one set in your shell. First, define a custom environment variable with your desired value:
$ export CUSTOM_ARM_CLIENT_SECRET=7654321
Then, use envVarMappings to tell the provider: “When you look for ARM_CLIENT_SECRET, read from CUSTOM_ARM_CLIENT_SECRET instead.” The mapping format is { "SOURCE_VAR": "TARGET_VAR" }:
const provider = new command.Provider("command-provider", {}, {
envVarMappings: {
// If CUSTOM_ARM_CLIENT_SECRET exists, provider sees the value of ARM_CLIENT_SECRET
"CUSTOM_ARM_CLIENT_SECRET": "ARM_CLIENT_SECRET",
},
});
provider = command.Provider("command-provider",
opts=pulumi.ResourceOptions(
env_var_mappings={
# If CUSTOM_ARM_CLIENT_SECRET exists, provider sees the value of ARM_CLIENT_SECRET
"CUSTOM_ARM_CLIENT_SECRET": "ARM_CLIENT_SECRET",
}
)
)
provider, err := command.NewProvider(
ctx,
"command-provider",
&command.ProviderArgs{},
pulumi.EnvVarMappings(map[string]string{
// If CUSTOM_ARM_CLIENT_SECRET exists, provider sees the value of ARM_CLIENT_SECRET
"CUSTOM_ARM_CLIENT_SECRET": "ARM_CLIENT_SECRET",
}),
)
var provider = new Command.Provider("command-provider", new Command.ProviderArgs(), new CustomResourceOptions
{
EnvVarMappings = new Dictionarystring, string>
{
// If CUSTOM_ARM_CLIENT_SECRET exists, provider sees the value of ARM_CLIENT_SECRET
{ "CUSTOM_ARM_CLIENT_SECRET", "ARM_CLIENT_SECRET" }
}
});
var provider = new Provider("command-provider", ProviderArgs.Empty, CustomResourceOptions.builder()
.envVarMappings(Map.of(
// If CUSTOM_ARM_CLIENT_SECRET exists, provider sees the value of ARM_CLIENT_SECRET
"CUSTOM_ARM_CLIENT_SECRET", "ARM_CLIENT_SECRET"
))
.build());
resources:
command-provider:
type: pulumi:providers:command
options:
envVarMappings:
# If CUSTOM_ARM_CLIENT_SECRET exists, provider sees the value of ARM_CLIENT_SECRET
CUSTOM_ARM_CLIENT_SECRET: ARM_CLIENT_SECRET
You can now customize each environment variable value your provider sees by defining a new environment variable, and then mapping your provider’s defined variable to yours. Try it out with Pulumi v3.220.0 today!
For full details, see the envVarMappings documentation.
Happy coding!
Platform teams need visibility into package adoption at scale. Responding to security advisories, planning deprecations, and tracking version sprawl all require knowing which stacks run which package versions across your organization.
Previously, we introduced the “Used by” tab on individual package pages, giving you visibility into which stacks use a specific package. However, navigating package by package doesn’t scale when you’re managing dozens of packages across hundreds of stacks.
Today, we’re extending that visibility to the organization level. You can now see adoption data for all packages at a glance, filter by usage status, and share specific views with your team.
The package list now displays three usage columns for each package:
Stacks on latest: the number of stacks running the latest version
Not on latest: the number of stacks running older versions
Total: all stacks using any version of the package
These numbers update as stacks are deployed, giving you a real-time view of adoption across your organization.
Three filters help you find packages that need attention:
Used: packages with at least one stack
Unused: packages with zero usage
Not on latest: packages where stacks are running older versions
Combine filters with search to find specific packages.
The new Registry tab under Platform shows all packages available to your organization, including public providers and components from pulumi.com/registry alongside your organization’s private packages. The Private Components tab (previously called Components) now includes the same usage columns and filters.
Search queries, filters, and pagination sync to the URL. Copy the URL to share a specific view with your team, or bookmark it for quick access to your regular monitoring workflow.
These features are designed for the scenarios platform teams face regularly:
Security response: filter to “Not on latest” to identify stacks running vulnerable versions
Deprecation planning: before retiring a package, check its usage to understand the migration scope
Version sprawl: identify packages where teams are running many different versions and prioritize standardization efforts
Adoption tracking: see which packages are gaining traction and which aren’t being adopted
Navigate to Platform > Registry in Pulumi Cloud to explore your organization’s packages with the new usage columns and filters. For more details on the private registry features, see the Private Registry documentation.
Before Platybot, our #analytics Slack channel was a support queue. Every day, people from every team would ask questions: “Which customers use feature X?”, “What’s our ARR by plan type?”, “Do we have a report for template usage?” Our two-person data team was a bottleneck.
Our #analytics channel, before Platybot (dramatized).
We didn’t want to just throw an LLM at our Snowflake warehouse either. Without guardrails, large language models generate SQL that may work but silently gets the answer wrong. Different join logic, wrong filters, missing snapshot handling, incorrect summarization. We needed something that could answer reliably for most queries, otherwise we’d switch to fixing LLM SQL queries.
So we built Platybot (platypus + bot, named after our mascot), an AI-powered analytics assistant that any Pulumi employee can use to query our Data Warehouse in natural language. It’s available as a Web App, a Slack bot, and a Model Context Protocol (MCP) server. The infrastructure is deployed with Pulumi IaC (Infrastructure as Code). But the most important thing we learned building it is that the AI was the easy part. The semantic layer is what makes it work.
The naive solution is obvious: connect an LLM to your database and let it write SQL. But this fails in practice, and the failure mode is insidious. Consider a few examples from our warehouse:
ARR is a snapshot metric. If you query ARR (Annual Recurring Revenue) without filtering by end-of-period dates (last day of month or quarter), you get duplicate rows and wildly inflated numbers. An LLM doesn’t automatically know this.
Account queries need exclusions. Most queries should exclude Pulumi’s own internal accounts and deleted ones. Without these filters, you’re counting test data alongside real customers.
User queries need employee filters. Querying active users without excluding Pulumi employees inflates adoption metrics.
The danger shows up when results feel decision-ready before anyone has validated how the numbers were derived. A confidently wrong ARR figure presented to leadership is worse than no answer at all.
Many organizations run into the same constraint once data usage spreads. Dashboards answer yesterday’s questions, not today’s. Ad-hoc LLM queries answer today’s questions, but incorrectly. The gap between “I have a question” and “I have a trustworthy answer” is where data teams can get stuck.
Before writing a single line of AI code, we built a semantic layer using Cube (open source). This was the hardest, least glamorous, and most important part of the entire project.
A semantic layer is a shared, versioned definition of what your business metrics mean.
“Monthly active users” starts with COUNT(DISTINCT user_id), but the aggregation is only the outer layer. It has to be applied to the right table (fct_pulumi_operations), with the right filters (exclude Pulumi employees, exclude deleted organizations), scoped to a calendar month, and counting only users who performed real operations — not just previews. The semantic layer encodes all of this once, and the AI can use it to build queries without needing to guess which tables are related to each other, whether it’s one-to-one or many-to-many, etc.
We organized our data into seven domains: Revenue, Cloud, Core, Clickstream, Community, Support, and People. Each domain contains cubes (think of them as well-defined, composable views) with explicit measures, dimensions, and joins. Here’s a real example from our Cloud domain (trimmed for readability):
cubes:
- name: fct_pulumi_operations
sql_table: CLOUD.FCT_PULUMI_OPERATIONS
description: >
Pulumi CLI operations (update, preview, destroy, refresh).
Each row is one operation with resource changes, duration,
and CLI environment.
joins:
- name: dim_organization
sql: '{CUBE}.ORGANIZATION_HK = {dim_organization}.ORGANIZATION_HK'
relationship: many_to_one
- name: dim_stacks
sql: '{CUBE}.STACK_PROGRAM_HK = {dim_stacks}.STACK_PROGRAM_HK'
relationship: many_to_one
- name: dim_user
sql: '{CUBE}.USER_HK = {dim_user}.USER_HK'
relationship: many_to_one
measures:
- name: count
type: count
- name: resource_count
sql: '{CUBE}.RESOURCE_COUNT'
type: sum
description: Number of resources active when the operation finished
- name: operations_succeeded
sql: "CASE WHEN {CUBE}.OPERATION_STATE = 'succeeded' THEN 1 ELSE 0 END"
type: sum
description: Count of operations that succeeded
# ... plus other measures
dimensions:
- name: operation_type
sql: '{CUBE}.OPERATION_TYPE'
type: string
description: Type of operation (Refresh, Update, Destroy, etc)
- name: operation_state
sql: '{CUBE}.OPERATION_STATE'
type: string
description: Last known state of this operation
# ... 20+ more dimensions
The key insight: the semantic layer makes the AI’s job tractable. Instead of generating arbitrary SQL from scratch, where the search space is “any possible SQL query against hundreds of tables,” the AI picks from a defined set of measures, dimensions, and joins. The search space shrinks from almost infinite to a well-bounded set of valid combinations.
A semantic layer is to an AI data assistant what a type system is to a programming language. It doesn’t eliminate errors, but it makes entire categories of mistakes structurally impossible. The AI can’t calculate ARR wrong because it doesn’t calculate ARR at all. It references a pre-defined measure that already encodes the correct logic.
Building the semantic layer was mainly data engineering work. We already had agreed-upon metric definitions across the company. The challenge was encoding those definitions into Cube: specifying the correct joins between tables, wiring up the right filters, and making sure every measure matched the logic our dashboards already used. Tedious, but essential.
With the semantic layer in place, the AI becomes a translation problem: convert natural language into a Cube query.
Platybot supports multiple models: Claude Opus 4.6, Claude Sonnet 4.5, and Gemini 3 Pro. Users can choose which model to use. We found that different models have different strengths. Claude excels at structured data queries, while Gemini performs very well on text-heavy tasks like analyzing call transcriptions.
The system prompt gives the model awareness of available cubes, their measures, dimensions, and joins. When a user asks “What’s the ARR breakdown by plan type?”, the model doesn’t write SQL. Instead, it constructs a Cube query, selecting the total_arr measure from the ARR table and grouping by the sku from the subscriptions dimension. Cube handles the SQL generation, the joins, and the filters. For edge cases the semantic layer doesn’t cover, the model can fall back to direct (read-only) SQL against Snowflake, but it may already have a basic query that is already close to what it needs.
Query generation follows a workflow that maps to the same tools a human analyst would use:
flowchart TB subgraph Entry["Entry points"] Web["Web UI"] Slack["Slack (@platybot)"] MCP["MCP Server"] end subgraph Core["Platybot core"] BE["Backend (Express + LLM)"] end subgraph Data["Data layer"] Cube["Cube semantic layer"] SF["Snowflake"] end Web --> BE Slack --> BE MCP --> BE BE -->|"Structured queries"| Cube BE -->|"Fallback SQL (read-only)"| SF Cube --> SF
The workflow is: discover domains, explore cubes, understand the schema, construct and execute the query. This mirrors how a data analyst would work. You don’t jump straight to SQL; you first understand what data is available and what the metrics mean.
The system’s role is narrower than people expect. It’s a translator between human intent and a well-defined data model, not a general-purpose data scientist. This is a feature, not a limitation. By constraining Platybot to operate within the semantic layer, we get reliability. The creativity goes into understanding the question, not into inventing SQL.
The backend solved query generation. It did not solve usage and if people don’t use it, it doesn’t matter. We launched Platybot across three interfaces, each designed for a different workflow.
The web app is the primary interface: a React 19 + TypeScript + Vite + Tailwind conversational UI where employees can explore data through multi-turn conversations. It supports table visualization, data export, and conversation history so you can pick up where you left off.
Every analysis produces a shareable report with a permanent link, protected behind company authentication. The report shows the agent’s reasoning so you can verify its approach, and lists every query and table used along with a sample of the data. Each query includes a direct link to run it in Metabase, our reporting tool, so any useful query can be saved, scheduled, or extended without starting from scratch.
The web UI sees the highest adoption of all three access points. It’s where people go for deeper data exploration: multi-step analyses, follow-up questions, and comparing metrics across different dimensions.
One fun detail we’ve added during our last hackathon: while queries iterate through analysis steps (discovering domains, exploring cubes, executing queries), users are entertained by a platypus-themed runner game. Think Chrome’s dinosaur game, but with our mascot on a skateboard. Sometimes Platybot can take a long time iterating, so you can try to beat your high score in the meantime!
Platybot is also available as a Slack bot. Users @mention it in any channel, and it replies in a thread with the answer, keeping the channel clean while making results visible to the whole team. It’s best suited for quick lookups — “what’s the ARR for account X?” — where the answer fits in a message rather than a deep analysis. Most usage stayed in the web UI, but the Slack bot fills a different niche: answers that benefit from being shared in context.
The Model Context Protocol (MCP) server is the newest addition, launched February 4, 2026. It exposes six tools that any MCP-compatible client can use:
list_domains - discover available data domains
list_cubes - explore cubes within a domain
get_cube_details - get measures, dimensions, and joins for a cube
execute_cube_query - run a structured query
execute_sql_query - raw SQL fallback (read-only)
get_generated_sql - preview SQL without executing
Setup is a single command:
claude mcp add --transport http platybot
This is a paradigm shift. Platybot goes from a destination (open the app, ask a question) to a capability that any AI tool can use. An engineer writing a postmortem can pull live metrics without leaving their terminal. An analyst building a report in Claude Code can query data inline. The Data Warehouse becomes ambient, always available, never in the way.
Security uses OAuth 2.0 Device Authorization Flow with PKCE for CLI-friendly authentication, with @pulumi.com domain restriction, read-only enforcement, rate limiting, and full audit logging.
A meta moment: we used the Platybot MCP server to query our Slack messages table to find the real usage data cited above.
Platybot’s infrastructure runs on AWS and is managed entirely with Pulumi IaC. The stack includes ECS Fargate for the application, an Application Load Balancer for traffic routing, EFS for persistent storage, ECR for container images, and Route 53 for DNS. The setup is intentionally simple and we use a small instance size since it’s just for internal usage.
Here’s a representative snippet of the Fargate service definition:
const service = new awsx.ecs.FargateService("platybot", {
cluster: cluster.arn,
assignPublicIp: false,
taskDefinitionArgs: {
container: {
name: "platybot",
image: image.imageUri,
essential: true,
portMappings: [{
containerPort: 3000,
targetGroup: targetGroup,
}],
environment: [
{ name: "CUBE_API_URL", value: cubeApiUrl },
{ name: "NODE_ENV", value: "production" },
],
secrets: [
{ name: "ANTHROPIC_API_KEY", valueFrom: anthropicKeySecret.arn },
{ name: "SNOWFLAKE_PASSWORD", valueFrom: snowflakePasswordSecret.arn },
],
logConfiguration: {
logDriver: "awslogs",
options: {
"awslogs-group": logGroup.name,
"awslogs-region": aws.config.region!,
"awslogs-stream-prefix": "platybot",
},
},
},
},
});
Dogfooding Pulumi for our internal tools has real benefits beyond the obvious. Staging environments are trivial: spin up a full copy of the stack with different parameters. Secrets management is built in, so API keys and database credentials never touch a config file. And when something needs to change, the diff-and-preview workflow (pulumi preview) catches mistakes before they hit production.
Since launch in September 2025: over 1,700 questions from 83 employees across every team. Usage grew steadily — from around 8 questions per day in the first month to 18 per day by January 2026, with 51 unique users that month alone. This was real production work, not experimentation. Customer analysis for sales calls, resource breakdowns for account managers, blog performance metrics for marketing, policy adoption research for product, ARR deep-dives for leadership.
The impact on the data team was immediate. Questions that used to land in the #analytics channel and wait for a human now get answered in seconds. The data team shifted from answering routine queries to building better models and improving data quality. We went from being a help desk to being a platform team.
Rumours of our replacement have been greatly exaggerated. Platybot just freed us up for the work that doesn't fit in a Slack reply. *
Accuracy is harder to quantify, but the semantic layer gives us confidence. Because Platybot uses pre-defined measures instead of generating arbitrary SQL, entire categories of errors (wrong joins, missing filters, incorrect aggregations) are structurally less likely. When the model does get something wrong, it’s usually in interpreting the question, not in the data — and users can verify that through the report’s reasoning trace.
The semantic layer matters more than the model. We spent more time defining metrics in Cube than we did on any AI work. That investment pays for itself: swap the model, and the answers stay correct. Swap the semantic layer, and nothing works.
Meet users where they already are. Different UIs handle different cognitive loads. Slack catches the quick questions, the web UI handles deep exploration, and MCP makes data ambient for AI-native workflows. Each interface serves a different mode of thinking.
Transparency builds trust. Showing the AI’s reasoning, the queries it ran, and linking to Metabase for verification turned skeptics into regular users. People don’t trust a black box, but they’ll trust a tool that shows its work.
When Claude Code first released skills, I ignored them. They looked like fancy prompts, another feature to add to the pile of things I would get around to learning eventually. Then I watched a few engineers demonstrate what skills actually do, and something clicked. By default, language models do not write good code. They write plausible code based on what they have read. Plausible code turns into bugs, horrible UX, and infrastructure that breaks at 3am.
Skills fill that gap. They package engineering expertise into something Claude can use. The workflows and judgment matter more than the raw information. Without skills, every conversation starts from zero. You explain the same conventions and correct the same mistakes. Every morning, back to zero.
Think about what separates a junior engineer from a senior one. Both can write code that compiles. Both can deploy infrastructure that runs. The difference is that the senior engineer knows the patterns that prevent problems before they happen. They know when to use component resources instead of plain resources. They know that creating infrastructure inside an apply() callback (Pulumi’s way of transforming outputs that are not known until deployment) breaks preview. They know that hardcoded credentials will eventually end up in a git log somewhere embarrassing.
This knowledge takes years to accumulate through painful experience. Skills let you transfer that knowledge to Claude in minutes. And here is the thing that makes them practical: if you find yourself doing the same type of task with different content each time, that is a skill waiting to be built. You encode the process once, then feed it new inputs forever.
I heard an analogy recently that made skills click for me. Imagine Claude as a mechanic. A capable mechanic who knows engines, can diagnose problems, and fix most cars that come through the shop.
MCP servers are like giving that mechanic a set of tools. Wrenches, diagnostic equipment, lift systems. Without tools, the mechanic cannot do much. With tools, the mechanic can work on whatever comes through the door.
But what happens when someone brings in a Formula 1 race car? Or a 1967 Ford Mustang that has been modified beyond recognition? The mechanic knows engines in general, but these specific vehicles require specific knowledge. The F1 car has procedures that must be followed in exact order. The vintage Mustang has quirks that only someone who has worked on that model would know.
Skills are the user manuals and standard operating procedures for these specific vehicles. They tell the mechanic what needs to happen and when. They encode the expertise of someone who has done this work a thousand times.
Or think about it from a carpenter’s perspective. Skills are the process to make the table: the measurements, the design, the exact steps. MCPs are the tools: the saw, the hammer, the drill. You need both. The process alone is theoretical, and tools without a process just sit in the garage.
For DevOps engineers working with Pulumi, this matters because infrastructure as code has its own quirks and patterns. Generic AI assistance produces code that looks reasonable but breaks conventions the community learned the hard way. Skills teach Claude those conventions.
Before skills clicked for me, I tried solving the expertise problem with MCPs. I kept adding servers until I noticed Claude getting slower and making worse decisions. Turns out the GitHub MCP alone eats 46,000 tokens across 91 tools before you type anything. Cursor eventually capped MCPs at 40 tools because too many options made everything worse.
Slash commands were another option, but you had to remember to invoke them. Anthropic apparently agreed, because in January 2026 they merged slash commands into skills. One unified system instead of two.
Skills avoid this through progressive disclosure. Claude reads just the description at startup, maybe a hundred tokens. The full procedures only load when Claude decides they are relevant. Unlike those massive system prompts that used to eat through your context window, skills stay out of the way until they are needed. For DevOps engineers running long infrastructure sessions with dozens of resources, this matters. You keep your context budget for the actual work instead of burning it on instructions. Skills can also fork context, spinning up isolated subagents that do work without polluting your main conversation. Think of it like handing a colleague a written brief. They go work on it, hand back a summary, and never sit in on your conversation.
I still use MCPs for connecting Claude to external systems. The Pulumi MCP server lets Claude query the registry and validate code. But MCPs give Claude access to things. Skills teach Claude how to think about things. Different jobs. They get more useful when you combine them. One engineer built a financial reporting skill that connects to his Mercury bank account via MCP, pulls every transaction for a given month, classifies the expenses into categories, and generates a styled HTML report with totals and breakdowns. A skill that knows your deployment process connecting to MCPs that talk to your actual infrastructure is the same idea, just pointed at ops instead of accounting.
**
Skills are portable. They follow an open standard, so a skill you write for Claude Code works in Cursor, GitHub Copilot, or anywhere else that supports agent skills. You can even copy the skill content into ChatGPT as a starting prompt. No vendor lock-in.
The first time you ask Claude to help with a Pulumi project, the process is painful. You have to explain the patterns you want. You correct mistakes. You explain why creating resources inside apply() breaks things. By the third or fourth project, you start copying your corrections from previous conversations.
I built the dirien/claude-skills pulumi-typescript skill after going through this painful process too many times. It knows the patterns that prevent common mistakes: Pulumi ESC (Environments, Secrets, and Configuration) integration, OIDC (OpenID Connect) instead of hardcoded access keys, ComponentResource abstractions (reusable groups of related resources), and proper output structuring so dependent stacks can consume them cleanly.
Skill What it teaches Claude
pulumi-typescript
Pulumi with TypeScript, ESC secrets management, component patterns, and multi-cloud deployment
npx skills add https://github.com/dirien/claude-skills --skill pulumi-typescript
Skills install as markdown files in your project’s .claude/skills/ directory, so they travel with your repo and are easy to review.
The next time you ask Claude to create infrastructure, it applies these patterns automatically. You do not have to remember to invoke the skill or correct the same mistakes repeatedly.
Pulumi maintains its own skills repository at pulumi/agent-skills, which they announced recently. The repo includes skills for ComponentResource patterns, Automation API, and migration from Terraform, CDK, CloudFormation, and ARM. The two I use daily are pulumi-esc and pulumi-best-practices.
The pulumi-esc skill teaches Claude how to work with Pulumi ESC (Environments, Secrets, and Configuration). It knows the difference between pulumi env get, pulumi env open, and pulumi env run. It sets up OIDC for dynamic credentials, integrates with external secret stores like AWS Secrets Manager and Vault, and structures layered environment composition so your dev, staging, and production configs inherit from a shared base.
The pulumi-best-practices skill catches the mistakes that burn you in production. It stops Claude from creating resources inside apply() callbacks, enforces proper parent relationships in ComponentResources, encrypts secrets from day one, and makes sure pulumi preview runs before any deployment. These are the patterns that took me years to internalize, and now Claude follows them by default.
Skill What it teaches Claude
pulumi-esc
Environment, secrets, and configuration management with OIDC, dynamic credentials, and secret store integration
pulumi-best-practices
Resource dependencies, ComponentResource patterns, secret encryption, and safe refactoring
npx skills add https://github.com/pulumi/agent-skills --skill pulumi-esc
npx skills add https://github.com/pulumi/agent-skills --skill pulumi-best-practices
You deploy something, it works, and six months later something breaks and you realize you never added monitoring. We have all been there. The monitoring skills from the community teach Claude to add observability from the start.
The jeffallan/claude-skills repository contains a monitoring-expert skill that knows Prometheus, Grafana, and DataDog.
Skill What it teaches Claude
monitoring-expert
Structured logging, metrics, distributed tracing, alerting, and performance testing for production systems
npx skills add https://github.com/jeffallan/claude-skills --skill monitoring-expert
In my testing, deploying a static website with these skills installed looks different from vanilla Claude. Instead of just creating the S3 bucket and CloudFront distribution, Claude asked about error rate thresholds before writing any code. It suggested CloudWatch alarms and created an SNS topic for alerts. The results are not always this clean. Sometimes the monitoring suggestions are generic or miss your specific SLO requirements. But the baseline shifted from “no monitoring at all” to “monitoring that needs tuning,” and that is a better starting point.
Kubernetes has hundreds of configuration options. Most deployments use a handful of them. The problem is that the important options like security contexts, resource limits, and pod disruption budgets are easy to forget when you are focused on getting something to run.
The jeffallan/claude-skills kubernetes-specialist skill focuses on configurations that production deployments actually need. Without it, ask Claude for a deployment and you get something that runs: the right image, the right ports, maybe a service. With the skill, the same request comes back with runAsNonRoot: true in the security context, resource requests and limits that reflect actual usage patterns, liveness and readiness probes with sensible intervals, and a pod disruption budget. These are the things that make the difference between “it works in staging” and “it survives a node failure in production.” The skill also understands when RollingUpdate makes sense versus Recreate, which is the kind of judgment call that usually requires context a generic model does not have.
The wshobson/agents repository fills in the gaps around Kubernetes with CI/CD, cost management, and deployment workflows:
Skill What it teaches Claude
kubernetes-specialist
Production cluster management, security hardening, and cloud-native architectures
cost-optimization
Cloud cost reduction across AWS, Azure, and GCP with right-sizing and reserved instances
github-actions-templates
CI/CD workflows, Docker builds, Kubernetes deployments, security scanning, and matrix builds
gitops-workflow
ArgoCD and Flux CD for automated Kubernetes deployments
npx skills add https://github.com/jeffallan/claude-skills --skill kubernetes-specialist
npx skills add https://github.com/wshobson/agents --skill gitops-workflow
npx skills add https://github.com/wshobson/agents --skill github-actions-templates
npx skills add https://github.com/wshobson/agents --skill cost-optimization
The obra/superpowers repository contains a skill that changed how I debug with Claude. The systematic-debugging skill implements a four phase framework: root cause investigation, pattern analysis, hypothesis testing, and implementation.
Without this skill, Claude tends to suggest solutions immediately. Something is broken, here are five things that might fix it. This feels helpful but often wastes time because none of the suggestions address the actual problem.
With the systematic debugging skill, Claude approaches problems differently. It asks clarifying questions. It wants to see logs. It builds a model of what is happening before suggesting changes. When it proposes a fix, it explains why that fix addresses the root cause. Sometimes skills find problems you did not know about. One engineer pointed a skills-equipped Claude at a set of SEO pages and discovered they had been decaying for months with nobody watching. The infrastructure parallel is obvious: configuration drift, unused resources, permissions that expanded over time. A debugging skill that investigates before prescribing will find these things.
Skill What it teaches Claude
systematic-debugging
Root cause investigation, pattern analysis, hypothesis testing, and verified implementation
npx skills add https://github.com/obra/superpowers --skill systematic-debugging
Two skills cover different sides of security review. The wshobson/agents k8s-security-policies skill handles Kubernetes-specific hardening: NetworkPolicies, Pod Security Standards, RBAC, OPA Gatekeeper constraints, and service mesh mTLS configuration. The sickn33/antigravity-awesome-skills security-review skill covers application-level concerns like secrets management, SQL injection, XSS prevention, and input validation.
I asked Claude to check a Pulumi program that created an S3 bucket. Without the security skills, Claude confirmed the code was correct and moved on. With the skills loaded, it flagged that the bucket had no server-side encryption configured, the bucket policy allowed s3:* from an overly broad principal, and there was no access logging enabled. On the Kubernetes side, the k8s-security-policies skill catches things like missing default-deny NetworkPolicies and containers running as root. These skills are not a replacement for deterministic tools like tfsec, checkov, or trivy. Those catch known issues every time. Skills are probabilistic and work best as an extra layer during development, not as your only security gate.
Skill What it teaches Claude
k8s-security-policies
Network policies, pod security standards, RBAC, and admission control for defense-in-depth
security-review
Secrets management, input validation, SQL injection, XSS/CSRF prevention, and dependency auditing
npx skills add https://github.com/wshobson/agents --skill k8s-security-policies
npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill security-review
At 3am when something breaks, you want runbooks. The incident-runbook-templates skill from wshobson/agents helps Claude create these before you need them. It includes a four-level severity model (SEV1 through SEV4) with response time expectations, escalation decision trees, and communication templates for status updates.
When you ask Claude to document your deployment process, it produces runbooks with diagnostic steps, rollback protocols, and verification checks. It knows kubectl commands for Kubernetes recovery and SQL procedures for PostgreSQL troubleshooting. The output needs editing. Generated runbooks tend to be thorough on the happy path but thin on the failure modes that matter most at 3am. I treat them as a first draft that gets me to 60% in minutes instead of hours, then fill in the gaps from experience.
Skill What it teaches Claude
incident-runbook-templates
Detection, triage, mitigation, resolution, and communication procedures for production incidents
npx skills add https://github.com/wshobson/agents --skill incident-runbook-templates
The skills above target specific problems. The jeffallan/claude-skills repository also includes two broader skills that cover the day-to-day work that does not fit neatly into one category.
The devops-engineer skill gives Claude a senior DevOps engineer persona covering CI/CD pipelines, container management, deployment strategies like blue-green and canary, and infrastructure as code across AWS, GCP, and Azure. It enforces constraints I care about: no deploying to production without approval, no secrets in code, no unversioned container images.
The sre-engineer skill focuses on reliability: SLO/SLI definitions, error budget calculations, golden signal dashboards, and toil reduction through automation. It produces Prometheus/Grafana configs, remediation runbooks, and reliability assessments. If you run production systems and want Claude to think about error budgets instead of just uptime, this is the skill.
Skill What it teaches Claude
devops-engineer
CI/CD pipelines, container management, deployment strategies, and infrastructure as code across clouds
sre-engineer
SLI/SLO management, error budgets, monitoring, automation, and incident response
npx skills add https://github.com/jeffallan/claude-skills --skill devops-engineer
npx skills add https://github.com/jeffallan/claude-skills --skill sre-engineer
Before you install everything in sight, a warning. Skills run with the same permissions as your AI agent. A malicious skill can exfiltrate credentials, download backdoors, or disable safety mechanisms, and it will look like your agent doing it.
Snyk researchers published ToxicSkills in February 2026 after scanning 3,984 skills from public registries. 13.4% had critical-level vulnerabilities, and they found 76 confirmed malicious payloads. The attack techniques included base64-encoded commands that steal AWS credentials, skills that direct you to download password-protected executables from attacker infrastructure, and jailbreak attempts that try to disable safety mechanisms. 91% of malicious skills combine code-level malware with prompt injection, so they attack on two fronts simultaneously.
Treat skills like you treat any third-party dependency:
Read the source before installing. Skills are markdown and YAML files. If you cannot read the full skill in a few minutes, that is a red flag.
Check the repository. Look at stars, contributors, and commit history. A single-commit repository from an unknown account deserves scrutiny.
Run uvx mcp-scan@latest --skills to scan installed skills for known malicious patterns, prompt injection, and credential exposure.
Be cautious with skills that fetch external content at runtime. The Snyk research found 17.7% of skills on ClawHub pull from third-party URLs, which means the skill’s behavior can change after you install it.
Stick to known repositories. Every skill recommended in this post comes from a repository with visible maintainers and community activity.
Eight malicious skills were still publicly available on ClawHub when Snyk published their findings. The skills ecosystem is young, and the vetting infrastructure is still catching up.
Stacking skills is where this pays off. Install the Pulumi skills and Claude writes better infrastructure code. Add monitoring and security on top and you start catching problems that used to slip through to production.
A note on stacking: I have not hit conflicts running all of these simultaneously, but more skills means more descriptions for Claude to evaluate at startup. If you notice Claude getting slower or making odd choices, pare back to the skills you actually use for that project. Start with the Pulumi and monitoring skills, add others as you need them.
Here is how to set up a new project with all of them:
mkdir -p pulumi-skills-demo && cd pulumi-skills-demo
pulumi new aws-typescript --name skills-demo --yes
npx skills add https://github.com/dirien/claude-skills --skill pulumi-typescript
npx skills add https://github.com/pulumi/agent-skills --skill pulumi-esc
npx skills add https://github.com/pulumi/agent-skills --skill pulumi-best-practices
npx skills add https://github.com/obra/superpowers --skill systematic-debugging
npx skills add https://github.com/jeffallan/claude-skills --skill monitoring-expert
npx skills add https://github.com/jeffallan/claude-skills --skill kubernetes-specialist
npx skills add https://github.com/wshobson/agents --skill gitops-workflow
npx skills add https://github.com/wshobson/agents --skill github-actions-templates
npx skills add https://github.com/wshobson/agents --skill cost-optimization
npx skills add https://github.com/wshobson/agents --skill incident-runbook-templates
npx skills add https://github.com/jeffallan/claude-skills --skill devops-engineer
npx skills add https://github.com/jeffallan/claude-skills --skill sre-engineer
npx skills add https://github.com/wshobson/agents --skill k8s-security-policies
npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill security-review
Then try these two prompts to see how many skills activate at once:
# Static website — triggers Pulumi TypeScript, monitoring, and security skills
Create a Pulumi TypeScript program for a static website on AWS with S3, CloudFront, OIDC credentials via Pulumi ESC, CloudWatch monitoring, and /security-review the infrastructure before deploying
# EKS cluster — stacks Kubernetes, GitOps, incident response, cost, and SRE skills
Create a Pulumi TypeScript program for an EKS cluster with /kubernetes-specialist security hardening, /gitops-workflow for ArgoCD deployment, /incident-runbook-templates for the cluster, /cost-optimization recommendations, and /sre-engineer SLO definitions for the services
The first prompt triggers the Pulumi TypeScript, monitoring, and security review skills in a single conversation. The second stacks Kubernetes, GitOps, incident response, cost, and SRE skills on one cluster build. You get infrastructure code, operational runbooks, and security policies from a single request.
Fair warning: not every skill works perfectly on the first try. Some need iteration. Some produce output that you have to review and tweak before it matches your standards. Skills do not replace your judgment.
That said, after a few weeks with these skills installed, I stopped correcting the same mistakes. The code Claude writes now looks like code I would write, not code I would have to fix. That is the whole point. Skills just stop you from repeating the same corrections across every conversation.
Every skill in this post includes its install command. Pick the section that matches your biggest pain point, run the npx skills add command, and try it on your next task. Skills work in Claude Code, Cursor, GitHub Copilot, and anything else that supports the Agent Skills standard.
The Pulumi Agent Skills announcement has more details, and the GitHub repository has the source. If you want something that goes further, with organizational context and deployment governance, look at Pulumi Neo. Neo is grounded in your actual infrastructure, not internet patterns. The 10 things you can do with Neo post shows what that looks like in practice.
Give it one project. That is all it took for me.
Neo now reads AGENTS.md files, the open standard for giving AI coding tools context about your project. If you’re already using AGENTS.md, Neo will pick up those same instructions automatically.
Every codebase has conventions that aren’t captured in linters or formatters. Maybe your team uses a specific naming pattern for infrastructure resources. Maybe there’s a particular way you structure tests, or commands that need to run in a certain order. These are the things you’d explain to a new team member, and now you can explain them to AI tools too.
Without something like AGENTS.md, you end up repeating yourself. Every conversation starts with “remember to use TypeScript” or “make sure you add the environment tag.” It’s tedious, and things slip through.
AGENTS.md gives these instructions a home. You write them once, commit the file to your repo, and any tool that supports the format picks them up automatically.
Think about what you’d tell someone on their first day working in the codebase. How do you run tests? Are there naming conventions? Any gotchas they should know about?
Here’s an example for a Pulumi project:
# Infrastructure conventions
Run tests with `make test`. This spins up LocalStack, so Docker must be running.
Stacks are named `{service}-{region}-{env}` (e.g., `payments-us-west-2-prod`).
Only the platform team deploys to prod stacks.
All resources need these tags: `cost-center`, `team`, `environment`.
Reusable components live in `components/`. Check there before writing
something new.
There’s no required structure, just markdown. Some teams write a few lines, others write detailed guides. Start small and add things as you notice yourself repeating instructions.
When you point Neo at a repository, it reads any AGENTS.md file it finds and applies those instructions to its work. You don’t need to mention the file or remind Neo about your conventions.
If you have a monorepo, you can put AGENTS.md files in subdirectories too. Neo uses the nearest one to wherever it’s working, so you can have general instructions at the root and more specific ones in subpackages.
Your instructions in conversation always take precedence, so you can override the file when you need to. If you’ve also set up Custom Instructions at the organization level, Neo applies those first, then AGENTS.md on top.
AGENTS.md is an open format supported by most AI coding tools: Cursor, Windsurf, GitHub Copilot, Zed, and now Neo. If your team uses different tools for different tasks, they’ll all follow the same project conventions without any extra configuration.
The format is managed by the Agentic AI Foundation under the Linux Foundation, and it’s already in use in over 60,000 open source projects. See agents.md for the full specification.
Add an AGENTS.md file to your repository and Neo will start using it on your next task. For more on configuring Neo, including organization-wide Custom Instructions and Slash Commands, see the Settings documentation.
We’re thrilled to announce that the Pulumi Cloud REST API is now described by an OpenAPI 3.0 specification, and we’re just getting started.
This is a feature that has been a long time coming. We have heard your requests for OpenAPI support loud and clear, and we’re excited to share that not only do we have a published specification for consumption, but our API code is now built from this specification as well. Moving forward, this single source of truth unlocks better tooling, tighter integration, and a more predictable API experience for everyone.
You can fetch the spec directly from the API at runtime or use it for client generation, validation, and documentation, all from one machine-readable contract.
The Pulumi Cloud API powers the Pulumi CLI, the Pulumi Console, and third-party integrations. Until now, there was no single, published machine-readable description of that API. We’ve changed that. The API is now defined and served as a standard OpenAPI 3.0.3 document.
Runtime discovery: You can retrieve the spec from the API itself, so your tooling always sees the same surface the service implements.
Client generation: Use your favorite OpenAPI tooling (e.g. OpenAPI Generator, Swagger Codegen) to generate API clients in the language of your choice.
Validation and testing: Validate requests and responses, or build mocks and tests, from the same spec the service uses.
Documentation: The spec is the source of truth, not a separate, hand-maintained API doc that can drift from reality. Load the spec into Swagger UI, Redoc, or another viewer to browse the Pulumi Cloud API interactively.
Send a GET request to:
https://api.pulumi.com/api/openapi/pulumi-spec.json
No authentication is required. The response is the OpenAPI 3.0 document for the Pulumi Cloud API, describing the supported, documented API surface.
We do not hand-write the OpenAPI spec. We generate it from the same API definition that drives our backend and console code. When we add or change API routes or models, we regenerate the spec so the published document stays in sync with what the service actually implements. That gives you a clear, stable contract for the Pulumi Cloud API.
We are using this spec as the foundation for our own tooling, and have plans to continue leveraging the spec in our toolchain long-term.
CLI: We plan to drive the Pulumi CLI’s API client from the OpenAPI spec so that CLI and API stay in lockstep.
Pulumi Service Provider: We are also building towards day 1 updates to the Pulumi Service Provider so that new and changed API resources are generated from the spec and ship in sync with the service.
Docs Enhancements: Although you can load the spec using Swagger UI for your own browsing, we are intent on shipping enhancements to our public REST API docs that will keep them up-to-date according to the OpenAPI spec.
As we ship those updates, you will get a single source of truth from API to CLI to provider.
If you have questions or feedback about the OpenAPI spec or the Pulumi Cloud API, reach out in our Community Slack or open an issue in the Pulumi repository. We’re excited to see what you build with it.
Neo shows its work, but until now that context was only viewable by the user that initiated the conversation. When you wanted a teammate’s input on a decision Neo made, you had to describe it in Slack or screenshot fragments of the conversation. Today we’re introducing task sharing: share a read-only view of any Neo task with anyone in your organization, full context preserved.
To share a Neo task, click the share button to generate a read-only link, then send it to a teammate. They see the complete picture: the original prompt, Neo’s reasoning process, the actions it took, and the outcome. Instead of writing up what happened and losing detail in the retelling, you share the task itself.
Video
We built this with security as a core constraint. The original task system enforced strict RBAC, ensuring users could only see and act on resources they had permission to access. Task sharing preserves these guarantees. Viewers can see the conversation with Neo, but they cannot trigger any actions, and links within the shared task to stacks or resources still enforce the viewer’s existing permissions.
The feature is available now. The next time you want a second opinion or need to show a colleague how you solved something, share the task. You’re no longer working alone.
AI coding assistants have transformed how developers write software, including infrastructure code. Tools like Claude Code, Cursor, and GitHub Copilot can generate code, explain complex systems, and automate tedious tasks. But when it comes to infrastructure, these tools often produce code that works but misses the mark on patterns that matter: proper secret handling, correct resource dependencies, idiomatic component structure, and the dozens of other details that separate working infrastructure from production-ready infrastructure.
We built Neo for teams that want deep Pulumi expertise combined with organizational context and deployment governance. But developers have preferred tools, and we want people to succeed with Pulumi wherever they work. Some teams live in Claude Code. Others use Cursor, Copilot, Codex, Gemini CLI, or other platforms. That is why we are releasing Pulumi Agent Skills, a collection of packaged expertise that teaches any AI coding assistant how to work with Pulumi the way an experienced practitioner would.
Skills are structured knowledge packages that follow the open Agent Skills specification. They work across multiple AI coding platforms including Claude Code, GitHub Copilot, Cursor, VS Code, Codex, and Gemini CLI. When you install Pulumi skills, your AI assistant gains access to detailed workflows, code patterns, and decision trees for common infrastructure tasks.
We are launching a set of skills organized into two plugin groups: authoring and migration. You can install all skills at once or choose specific plugin groups based on your needs.
This plugin includes four skills focused on code quality, reusability, and configuration.
Pulumi best practices encodes the patterns that prevent common mistakes. It covers output handling, component structure, secrets management, safe refactoring with aliases, and deployment workflows. The skill flags anti-patterns that can cause issues with preview, dependencies, and production deployments.
Pulumi Component provides a complete guide for authoring ComponentResource classes. The skill covers designing component interfaces, multi-language support, and distribution. It teaches assistants how to build reusable infrastructure abstractions that work across TypeScript, Python, Go, C#, Java, and YAML.
Pulumi Automation API covers programmatic orchestration of Pulumi operations. The skill explains when to use Automation API versus the CLI, the tradeoffs between local source and inline programs, and patterns for multi-stack deployments.
Pulumi ESC covers centralized secrets and configuration management. The skill guides assistants through setting up dynamic OIDC credentials, composing environments, and integrating secrets into Pulumi programs and other applications.
Convert and import infrastructure from other tools to Pulumi. This plugin includes four skills covering complete migration workflows, not just syntax translation.
Terraform to Pulumi walks through the full migration workflow. It handles state translation, provider version alignment, and the iterative process of achieving a clean pulumi preview with no unexpected changes.
CloudFormation to Pulumi covers the complete AWS CloudFormation migration workflow, from template conversion and stack import to handling CloudFormation-specific constructs.
CDK to Pulumi covers the complete AWS CDK migration workflow end to end, from conversion and import to handling CDK-specific constructs like Lambda-backed custom resources and cross-stack references.
Azure to Pulumi covers the complete Azure Resource Manager and Bicep migration workflow, handling template conversion and resource import with guidance on achieving zero-diff validation.
For Claude Code users, the plugin system provides the simplest installation experience:
claude plugin marketplace add pulumi/agent-skills
claude plugin install pulumi-authoring # Install authoring skills
claude plugin install pulumi-migration # Install migration skills
You can install both plugin groups or choose only the ones you need.
For Cursor, GitHub Copilot, VS Code, Codex, Gemini and other platforms, use the universal Agent Skills CLI:
npx skills add pulumi/agent-skills --skill '*'
This works across all platforms that support the Agent Skills specification.
Once installed, skills activate automatically based on context. When you ask your assistant to help migrate a Terraform project, it draws on the Terraform skill’s workflow. When you are debugging why resources are being recreated unexpectedly, the best practices skill helps the assistant check for missing aliases.
In Codex and Claude Code, you can invoke skills directly via slash commands.
/pulumi-terraform-to-pulumi
Or describe what you need in natural language:
“Help me migrate this CDK application to Pulumi”
“Review this Pulumi code for best practices issues”
“Create a reusable component for a web service with load balancer”
The assistant will follow the skill’s procedures, ask clarifying questions when needed, and produce output that reflects Pulumi best practices rather than generic code generation.
We expect this collection to grow. If you have Pulumi expertise worth packaging, whether provider-specific patterns, debugging workflows, or operational practices, we welcome contributions. See the contributing guide for details.
The skills are available now in the agent-skills repository. Install them in your preferred AI coding environment and let us know what you build.
Do you know what cloud resources are running in your environment right now? Many organizations struggle to maintain visibility across their cloud estate, especially for resources created outside of infrastructure as code. Without complete visibility, you can’t enforce compliance, optimize costs, or identify security risks.
Today, we’re excited to announce new resources in the Pulumi Service Provider that solve this problem by enabling you to discover all cloud resources and enforce governance policies programmatically using infrastructure as code.
With these new resources, you can:
Discover all cloud resources across AWS, Azure, GCP, Kubernetes, or OCI environments, including resources not managed by Pulumi
Import discovered resources into Pulumi management using Visual Import to bring unmanaged infrastructure under IaC control
Enforce compliance at scale by organizing resources into Policy Groups and applying policy packs
Automate governance workflows by managing everything through code, enabling GitOps and CI/CD integration
The Pulumi Service Provider now includes three new resources for managing cloud visibility and governance:
Resource Description
InsightsAccount
Configures cloud provider scanning for resource discovery
PolicyGroup
Organizes stacks or cloud accounts for policy enforcement
getPolicyPacks / getPolicyPack
Data sources for querying available policy packs
Let’s explore each of these in detail.
An InsightsAccount connects Pulumi Cloud to your cloud provider, enabling automated scanning and discovery of all resources in your environment. This gives you complete visibility into your cloud estate, including resources that aren’t managed by Pulumi.
Multi-cloud support: Scan AWS, Azure, GCP, Kubernetes, and OCI environments
Scheduled scanning: Configure daily automated scans or trigger them on-demand
Resource tagging: Organize your accounts with custom tags
name: insights-example
runtime: yaml
resources:
# ESC environment with AWS credentials
aws-credentials:
type: pulumiservice:Environment
properties:
organization: my-org
project: insights
name: aws-credentials
yaml:
fn::stringAsset: |
values:
aws:
login:
fn::open::aws-login:
oidc:
roleArn: arn:aws:iam::123456789012:role/PulumiInsightsRole
sessionName: pulumi-insights
environmentVariables:
AWS_REGION: us-west-2
# Insights account for AWS scanning
aws-insights:
type: pulumiservice:InsightsAccount
properties:
organizationName: my-org
accountName: production-aws
provider: aws
environment: insights/aws-credentials
scanSchedule: daily
providerConfig:
regions:
- us-west-2
- us-east-1
tags:
environment: production
team: platform
import * as pulumi from "@pulumi/pulumi";
import * as pulumiservice from "@pulumi/pulumiservice";
// ESC environment with AWS credentials
const awsCredentials = new pulumiservice.Environment("aws-credentials", {
organization: "my-org",
project: "insights",
name: "aws-credentials",
yaml: new pulumi.asset.StringAsset(`
values:
aws:
login:
fn::open::aws-login:
oidc:
roleArn: arn:aws:iam::123456789012:role/PulumiInsightsRole
sessionName: pulumi-insights
environmentVariables:
AWS_REGION: us-west-2
`),
});
// Insights account for AWS scanning
const awsInsights = new pulumiservice.InsightsAccount("aws-insights", {
organizationName: "my-org",
accountName: "production-aws",
provider: "aws",
environment: pulumi.interpolate`${awsCredentials.project}/${awsCredentials.name}`,
scanSchedule: "daily",
providerConfig: {
regions: ["us-west-2", "us-east-1"],
},
tags: {
environment: "production",
team: "platform",
},
});
import pulumi
import pulumi_pulumiservice as pulumiservice
# ESC environment with AWS credentials
aws_credentials = pulumiservice.Environment("aws-credentials",
organization="my-org",
project="insights",
name="aws-credentials",
yaml=pulumi.StringAsset("""
values:
aws:
login:
fn::open::aws-login:
oidc:
roleArn: arn:aws:iam::123456789012:role/PulumiInsightsRole
sessionName: pulumi-insights
environmentVariables:
AWS_REGION: us-west-2
"""))
# Insights account for AWS scanning
aws_insights = pulumiservice.InsightsAccount("aws-insights",
organization_name="my-org",
account_name="production-aws",
provider="aws",
environment=pulumi.Output.concat(aws_credentials.project, "/", aws_credentials.name),
scan_schedule="daily",
provider_config={
"regions": ["us-west-2", "us-east-1"],
},
tags={
"environment": "production",
"team": "platform",
})
package main
import (
"github.com/pulumi/pulumi-pulumiservice/sdk/go/pulumiservice"
"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)
func main() {
pulumi.Run(func(ctx *pulumi.Context) error {
// ESC environment with AWS credentials
awsCredentials, err := pulumiservice.NewEnvironment(ctx, "aws-credentials", &pulumiservice.EnvironmentArgs{
Organization: pulumi.String("my-org"),
Project: pulumi.String("insights"),
Name: pulumi.String("aws-credentials"),
Yaml: pulumi.NewStringAsset(`
values:
aws:
login:
fn::open::aws-login:
oidc:
roleArn: arn:aws:iam::123456789012:role/PulumiInsightsRole
sessionName: pulumi-insights
environmentVariables:
AWS_REGION: us-west-2
`),
})
if err != nil {
return err
}
// Insights account for AWS scanning
_, err = pulumiservice.NewInsightsAccount(ctx, "aws-insights", &pulumiservice.InsightsAccountArgs{
OrganizationName: pulumi.String("my-org"),
AccountName: pulumi.String("production-aws"),
Provider: pulumi.String("aws"),
Environment: pulumi.Sprintf("%s/%s", awsCredentials.Project, awsCredentials.Name),
ScanSchedule: pulumi.String("daily"),
ProviderConfig: pulumi.Map{
"regions": pulumi.ToStringArray([]string{"us-west-2", "us-east-1"}),
},
Tags: pulumi.StringMap{
"environment": pulumi.String("production"),
"team": pulumi.String("platform"),
},
})
if err != nil {
return err
}
return nil
})
}
A PolicyGroup lets you organize resources and apply policy packs for compliance enforcement. Policy Groups support two entity types: Stacks and Accounts.
You can configure policy groups in two modes:
Audit mode: Reports policy violations without blocking operations. This supports both Stacks and Accounts.
Preventative mode: Blocks operations that violate policies. This mode is only available for Stacks.
Apply compliance policies to your Pulumi stacks:
name: policy-group-example
runtime: yaml
resources:
production-policies:
type: pulumiservice:PolicyGroup
properties:
name: production-compliance
organizationName: my-org
entityType: stacks
mode: preventative
stacks:
- name: production
routingProject: my-app
- name: staging
routingProject: my-app
policyPacks:
- name: cis-aws
displayName: CIS AWS Foundations Benchmark
version: 1
versionTag: "1.5.0"
import * as pulumiservice from "@pulumi/pulumiservice";
const productionPolicies = new pulumiservice.PolicyGroup("production-policies", {
name: "production-compliance",
organizationName: "my-org",
entityType: "stacks",
mode: "preventative",
stacks: [
{ name: "production", routingProject: "my-app" },
{ name: "staging", routingProject: "my-app" },
],
policyPacks: [
{
name: "cis-aws",
displayName: "CIS AWS Foundations Benchmark",
version: 1,
versionTag: "1.5.0",
},
],
});
Apply compliance policies to your cloud accounts for resource governance:
name: insights-policy-group
runtime: yaml
resources:
# Insights account
aws-insights:
type: pulumiservice:InsightsAccount
properties:
organizationName: my-org
accountName: production-aws
provider: aws
environment: insights/aws-credentials
# Policy group targeting the Insights account
cloud-compliance:
type: pulumiservice:PolicyGroup
properties:
name: cloud-resource-compliance
organizationName: my-org
entityType: accounts
mode: audit
accounts:
- ${aws-insights.accountName}
policyPacks:
- name: aws-security-best-practices
displayName: AWS Security Best Practices
version: 2
import * as pulumiservice from "@pulumi/pulumiservice";
// Insights account
const awsInsights = new pulumiservice.InsightsAccount("aws-insights", {
organizationName: "my-org",
accountName: "production-aws",
provider: "aws",
environment: "insights/aws-credentials",
});
// Policy group targeting the Insights account
const cloudCompliance = new pulumiservice.PolicyGroup("cloud-compliance", {
name: "cloud-resource-compliance",
organizationName: "my-org",
entityType: "accounts",
mode: "audit",
accounts: [awsInsights.accountName],
policyPacks: [
{
name: "aws-security-best-practices",
displayName: "AWS Security Best Practices",
version: 2,
},
],
});
Use the getPolicyPacks and getPolicyPack data sources to discover available policy packs in your organization:
name: policy-packs-query
runtime: yaml
variables:
# List all available policy packs
availablePacks:
fn::invoke:
function: pulumiservice:getPolicyPacks
arguments:
organizationName: my-org
return: policyPacks
outputs:
policyPacks: ${availablePacks}
import * as pulumiservice from "@pulumi/pulumiservice";
// List all available policy packs
const availablePacks = pulumiservice.getPolicyPacksOutput({
organizationName: "my-org",
});
export const policyPacks = availablePacks.policyPacks;
To start using these new resources:
Update the Pulumi Service Provider to the latest version:
npm install @pulumi/pulumiservice@latest
pip install --upgrade pulumi-pulumiservice
go get github.com/pulumi/pulumi-pulumiservice/sdk/go/pulumiservice@latest
dotnet add package Pulumi.PulumiService
No package installation needed for YAML - just use the resources directly.
We’re excited to see how you use these new capabilities to improve visibility and governance across your cloud infrastructure. As always, we welcome your feedback in our Community Slack or on GitHub.
**
Update (January 2026): The lobster has molted into its final form! From Clawdbot to Moltbot to OpenClaw. With 100k+ GitHub stars and 2M visitors in a week, the project finally has a name that’ll stick. The CLI command is now openclaw and the new handle is @openclaw. Same mission: AI that actually does things. Your assistant. Your machine. Your rules. See the official getting started guide for updated installation instructions.
OpenClaw is everywhere right now. The open-source AI assistant gained 9,000 GitHub stars in a single day, received public praise from former Tesla AI head Andrej Karpathy, and has sparked a global run on Mac Minis as developers scramble to give this “lobster assistant” a home. Users are calling it “Jarvis living in a hard drive” and “Claude with hands”—the personal AI assistant that Siri promised but never delivered.
The Mac Mini craze is real: people are buying dedicated hardware just to run OpenClaw, with some enthusiasts purchasing 40 Mac Minis at once. Even Logan Kilpatrick from Google DeepMind couldn’t resist ordering one. But here’s the thing: you don’t actually need a Mac Mini. OpenClaw runs anywhere: on a VPS, in the cloud, or on that old laptop gathering dust.
With all this hype, I had to try it myself. But instead of clicking through the AWS console or running manual commands on a VPS, I wanted to do it right from the start: infrastructure as code with Pulumi. Why? Because when I inevitably want to tear it down, spin up a new instance, or deploy to a different region, I don’t want to remember which buttons I clicked three weeks ago. I want a single pulumi up command.
Dan got the assignment right:
In this post, I’ll show you how to deploy OpenClaw to AWS or Hetzner Cloud (if you want European data residency or just want to spend less). We’ll use Pulumi to define the infrastructure and Tailscale to keep your AI assistant off the public internet.
OpenClaw is an open-source AI assistant created by Peter Steinberger that runs on your own infrastructure. It connects to WhatsApp, Slack, Discord, Google Chat, Signal, and iMessage. It can control browsers, generate videos and images, clone your voice for voice notes, and run scheduled tasks via cron. There’s a skills system for extending functionality, and you can run it on pretty much anything: Mac Mini, Raspberry Pi, VPS, laptop, or gaming PC.
The difference from cloud-hosted AI? OpenClaw runs on your server, not Anthropic’s. It’s available 24/7 across all your devices, can schedule automated tasks, and keeps your entire conversation history locally. Check the official OpenClaw documentation for the full feature list.
Before getting started, ensure you have:
Pulumi CLI installed and configured
AWS account (for AWS deployment)
Hetzner Cloud account (for European deployment)
Anthropic API key
Node.js 18+ installed
Tailscale account with HTTPS enabled (one-time setup in admin console)
**
This guide uses Anthropic’s API, but OpenClaw works with other providers too. Check the providers documentation if you’d rather use OpenAI, Google Gemini, or a local model via Ollama.
OpenClaw uses a gateway-centric architecture where a single daemon acts as the control plane for all messaging, tool execution, and client connections:
Component Port Description
Gateway 18789 WebSocket server handling channels, nodes, sessions, and hooks
Browser control 18791 Headless Chrome instance for web automation
Isolated container environment for running tools safely
The Gateway connects to messaging platforms (WhatsApp, Slack, Discord, etc.), the CLI, the web UI, and mobile apps. The Browser component lets OpenClaw open web pages, fill forms, scrape data, and download files. Docker sandboxing runs bash commands in isolated containers so your bot can execute code without risking your host system.
Deploying OpenClaw means handling sensitive credentials: API keys, auth tokens, cloud provider secrets. You don’t want these hardcoded or scattered across environment variables. Pulumi ESC (Environments, Secrets, and Configuration) stores them securely and passes them directly to your Pulumi program.
Create a new ESC environment:
pulumi env init /openclaw-secrets
Add your secrets to the environment:
values:
anthropicApiKey:
fn::secret: "sk-ant-xxxxx"
tailscaleAuthKey:
fn::secret: "tskey-auth-xxxxx"
tailnetDnsName: "tailxxxxx.ts.net"
hcloudToken:
fn::secret: "your-hetzner-api-token"
pulumiConfig:
anthropicApiKey: ${anthropicApiKey}
tailscaleAuthKey: ${tailscaleAuthKey}
tailnetDnsName: ${tailnetDnsName}
hcloud:token: ${hcloudToken}
**
To find your Tailnet DNS name, go to the Tailscale admin console, look under the DNS section, and find your tailnet name (e.g., tailxxxxx.ts.net). This is the domain suffix used for all machines in your Tailscale network.
Then create a Pulumi.dev.yaml file in your project to reference the environment:
environment:
- /openclaw-secrets
This approach keeps your secrets out of your codebase and passes them directly to OpenClaw during automated onboarding.
By default, deploying OpenClaw exposes SSH (port 22), the gateway (port 18789), and browser control (port 18791) to the public internet. This is convenient for testing but not ideal for production use.
Tailscale creates a secure mesh VPN that lets you access your OpenClaw instance without exposing unnecessary ports publicly. When you provide a Tailscale auth key, the Pulumi program:
Removes gateway and browser ports from public access
Keeps SSH as fallback for debugging if Tailscale setup fails
Installs Tailscale on the instance during provisioning (after other dependencies)
Enables Tailscale SSH so you can SSH via Tailscale without managing keys
Joins your Tailnet automatically using the auth key
**
The Pulumi program installs Docker, Node.js, and OpenClaw first, then configures Tailscale last. This ensures that even if the Tailscale auth key is invalid or expired, you can still SSH in via the public IP to troubleshoot.
To generate a Tailscale auth key:
Go to Tailscale Admin Console
Click “Generate auth key”
Enable “Reusable” if you plan to redeploy
Copy the key and add it to your ESC environment
Let’s walk through the complete AWS deployment. Create a new Pulumi project:
mkdir openclaw-aws && cd openclaw-aws
pulumi new typescript
Install the required dependencies:
npm install @pulumi/aws @pulumi/tls
**
Do not use t3.micro instances for OpenClaw. The 1 GB memory is insufficient for installation. Use t3.medium (4 GB) or t3.large (8 GB) instead.
Running OpenClaw on AWS means setting up a VPC, subnets, security groups, an EC2 instance, SSH keys, and a cloud-init script that installs everything. That’s a lot of clicking in the AWS console. The Pulumi program below defines all of it in code.
Replace the contents of index.ts with the following:
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
import * as tls from "@pulumi/tls";
const config = new pulumi.Config();
const instanceType = config.get("instanceType") ?? "t3.medium";
const anthropicApiKey = config.requireSecret("anthropicApiKey");
const model = config.get("model") ?? "anthropic/claude-sonnet-4";
const enableSandbox = config.getBoolean("enableSandbox") ?? true;
const gatewayPort = config.getNumber("gatewayPort") ?? 18789;
const browserPort = config.getNumber("browserPort") ?? 18791;
const tailscaleAuthKey = config.requireSecret("tailscaleAuthKey");
const tailnetDnsName = config.require("tailnetDnsName");
// Generate a random token for gateway authentication
const gatewayToken = new tls.PrivateKey("openclaw-gateway-token", {
algorithm: "ED25519",
}).publicKeyOpenssh.apply(key => {
// Create a deterministic token from the public key (take first 48 hex chars)
const hash = require("crypto").createHash("sha256").update(key).digest("hex");
return hash.substring(0, 48);
});
const sshKey = new tls.PrivateKey("openclaw-ssh-key", {
algorithm: "ED25519",
});
const vpc = new aws.ec2.Vpc("openclaw-vpc", {
cidrBlock: "10.0.0.0/16",
enableDnsHostnames: true,
enableDnsSupport: true,
tags: { Name: "openclaw-vpc" },
});
const gateway = new aws.ec2.InternetGateway("openclaw-igw", {
vpcId: vpc.id,
tags: { Name: "openclaw-igw" },
});
const subnet = new aws.ec2.Subnet("openclaw-subnet", {
vpcId: vpc.id,
cidrBlock: "10.0.1.0/24",
mapPublicIpOnLaunch: true,
tags: { Name: "openclaw-subnet" },
});
const routeTable = new aws.ec2.RouteTable("openclaw-rt", {
vpcId: vpc.id,
routes: [
{
cidrBlock: "0.0.0.0/0",
gatewayId: gateway.id,
},
],
tags: { Name: "openclaw-rt" },
});
new aws.ec2.RouteTableAssociation("openclaw-rta", {
subnetId: subnet.id,
routeTableId: routeTable.id,
});
const securityGroup = new aws.ec2.SecurityGroup("openclaw-sg", {
vpcId: vpc.id,
description: "Security group for OpenClaw instance",
ingress: [
{
description: "SSH access (fallback)",
fromPort: 22,
toPort: 22,
protocol: "tcp",
cidrBlocks: ["0.0.0.0/0"],
},
],
egress: [
{
fromPort: 0,
toPort: 0,
protocol: "-1",
cidrBlocks: ["0.0.0.0/0"],
},
],
tags: { Name: "openclaw-sg" },
});
const keyPair = new aws.ec2.KeyPair("openclaw-keypair", {
publicKey: sshKey.publicKeyOpenssh,
});
const ami = aws.ec2.getAmiOutput({
owners: ["099720109477"],
mostRecent: true,
filters: [
{ name: "name", values: ["ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-*"] },
{ name: "virtualization-type", values: ["hvm"] },
],
});
const userData = pulumi
.all([tailscaleAuthKey, anthropicApiKey, gatewayToken])
.apply(([tsAuthKey, apiKey, gwToken]) => {
return `#!/bin/bash
set -e
export DEBIAN_FRONTEND=noninteractive
# System updates
apt-get update
apt-get upgrade -y
# Install Docker
curl -fsSL https://get.docker.com | sh
systemctl enable docker
systemctl start docker
usermod -aG docker ubuntu
# Install NVM and Node.js for ubuntu user
sudo -u ubuntu bash set -e
cd ~
# Install NVM
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash
# Load NVM
export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && . "$NVM_DIR/nvm.sh"
# Install Node.js 22
nvm install 22
nvm use 22
nvm alias default 22
# Install OpenClaw
npm install -g openclaw@latest
# Add NVM to bashrc if not already there
if ! grep -q 'NVM_DIR' ~/.bashrc; then
echo 'export NVM_DIR="$HOME/.nvm"' >> ~/.bashrc
echo '[ -s "$NVM_DIR/nvm.sh" ] && . "$NVM_DIR/nvm.sh"' >> ~/.bashrc
fi
UBUNTU_SCRIPT
# Set environment variables for ubuntu user
echo 'export ANTHROPIC_API_KEY="${apiKey}"' >> /home/ubuntu/.bashrc
# Install and configure Tailscale
echo "Installing Tailscale..."
curl -fsSL https://tailscale.com/install.sh | sh
tailscale up --authkey="${tsAuthKey}" --ssh || echo "WARNING: Tailscale setup failed. Run 'sudo tailscale up' manually."
# Enable systemd linger for ubuntu user (required for user services to run at boot)
loginctl enable-linger ubuntu
# Start user's systemd instance (required for user services during cloud-init)
systemctl start user@1000.service
# Run OpenClaw onboarding as ubuntu user (skip daemon install, do it separately)
echo "Running OpenClaw onboarding..."
sudo -H -u ubuntu ANTHROPIC_API_KEY="${apiKey}" GATEWAY_PORT="${gatewayPort}" bash -c '
export HOME=/home/ubuntu
export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && . "$NVM_DIR/nvm.sh"
openclaw onboard --non-interactive --accept-risk \
--mode local \
--auth-choice apiKey \
--gateway-port $GATEWAY_PORT \
--gateway-bind loopback \
--skip-daemon \
--skip-skills || echo "WARNING: OpenClaw onboarding failed. Run openclaw onboard manually."
'
# Install daemon service with XDG_RUNTIME_DIR set
echo "Installing OpenClaw daemon..."
sudo -H -u ubuntu XDG_RUNTIME_DIR=/run/user/1000 bash -c '
export HOME=/home/ubuntu
export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && . "$NVM_DIR/nvm.sh"
openclaw daemon install || echo "WARNING: Daemon install failed. Run openclaw daemon install manually."
'
# Configure gateway for Tailscale Serve (trustedProxies + skip device pairing + set token)
echo "Configuring gateway for Tailscale Serve..."
sudo -H -u ubuntu GATEWAY_TOKEN="${gwToken}" python3 import json
import os
config_path = "/home/ubuntu/.openclaw/openclaw.json"
with open(config_path) as f:
config = json.load(f)
config["gateway"]["trustedProxies"] = ["127.0.0.1"]
config["gateway"]["controlUi"] = {
"enabled": True,
"allowInsecureAuth": True
}
config["gateway"]["auth"] = {
"mode": "token",
"token": os.environ["GATEWAY_TOKEN"]
}
with open(config_path, "w") as f:
json.dump(config, f, indent=2)
print("Configured gateway with trustedProxies, controlUi, and token")
PYTHON_SCRIPT
# Enable Tailscale HTTPS proxy (requires HTTPS to be enabled in Tailscale admin console)
echo "Enabling Tailscale HTTPS proxy..."
tailscale serve --bg ${gatewayPort} || echo "WARNING: tailscale serve failed. Enable HTTPS in your Tailscale admin console first."
echo "OpenClaw setup complete!"
`;
});
const instance = new aws.ec2.Instance("openclaw-instance", {
ami: ami.id,
instanceType: instanceType,
subnetId: subnet.id,
vpcSecurityGroupIds: [securityGroup.id],
keyName: keyPair.keyName,
userData: userData,
userDataReplaceOnChange: true,
rootBlockDevice: {
volumeSize: 30,
volumeType: "gp3",
},
tags: { Name: "openclaw" },
});
export const publicIp = instance.publicIp;
export const publicDns = instance.publicDns;
export const privateKey = sshKey.privateKeyOpenssh;
// Construct the Tailscale MagicDNS hostname from the private IP
// AWS private IPs like 10.0.1.15 become hostnames like ip-10-0-1-15
const tailscaleHostname = instance.privateIp.apply(ip =>
`ip-${ip.replace(/\./g, "-")}`
);
export const tailscaleUrlWithToken = pulumi.interpolate`https://${tailscaleHostname}.${tailnetDnsName}/?token=${gatewayToken}`;
export const gatewayTokenOutput = gatewayToken;
Hetzner Cloud is a solid choice if you need European data residency or want to spend less money. Spoiler: it’s a lot less money.
Hetzner has similar concepts to AWS but different names. EC2 instances become Servers. Security groups become Firewalls. Same idea, different provider. The resource types come from @pulumi/hcloud.
Create a new project for Hetzner:
mkdir openclaw-hetzner && cd openclaw-hetzner
pulumi new typescript
Install the Hetzner provider:
npm install @pulumi/hcloud @pulumi/tls
**
The default server type cax21 is an ARM-based (Ampere) instance with 4 vCPUs and 8 GB RAM. ARM instances cost less for the same compute. If you need x86 architecture, use ccx13 or similar CCX series instead.
Replace index.ts with the following:
import * as pulumi from "@pulumi/pulumi";
import * as hcloud from "@pulumi/hcloud";
import * as tls from "@pulumi/tls";
const config = new pulumi.Config();
const serverType = config.get("serverType") ?? "cax21";
const location = config.get("location") ?? "fsn1";
const anthropicApiKey = config.requireSecret("anthropicApiKey");
const model = config.get("model") ?? "anthropic/claude-sonnet-4";
const enableSandbox = config.getBoolean("enableSandbox") ?? true;
const gatewayPort = config.getNumber("gatewayPort") ?? 18789;
const browserPort = config.getNumber("browserPort") ?? 18791;
const tailscaleAuthKey = config.requireSecret("tailscaleAuthKey");
const tailnetDnsName = config.require("tailnetDnsName");
// Generate a random token for gateway authentication
const gatewayToken = new tls.PrivateKey("openclaw-gateway-token", {
algorithm: "ED25519",
}).publicKeyOpenssh.apply(key => {
const hash = require("crypto").createHash("sha256").update(key).digest("hex");
return hash.substring(0, 48);
});
const sshKey = new tls.PrivateKey("openclaw-ssh-key", {
algorithm: "ED25519",
});
const hcloudSshKey = new hcloud.SshKey("openclaw-sshkey", {
publicKey: sshKey.publicKeyOpenssh,
});
const firewallRules: hcloud.types.input.FirewallRule[] = [
{
direction: "out",
protocol: "tcp",
port: "any",
destinationIps: ["0.0.0.0/0", "::/0"],
description: "Allow all outbound TCP",
},
{
direction: "out",
protocol: "udp",
port: "any",
destinationIps: ["0.0.0.0/0", "::/0"],
description: "Allow all outbound UDP",
},
{
direction: "out",
protocol: "icmp",
destinationIps: ["0.0.0.0/0", "::/0"],
description: "Allow all outbound ICMP",
},
{
direction: "in",
protocol: "tcp",
port: "22",
sourceIps: ["0.0.0.0/0", "::/0"],
description: "SSH access (fallback)",
},
];
const firewall = new hcloud.Firewall("openclaw-firewall", {
rules: firewallRules,
});
const userData = pulumi
.all([tailscaleAuthKey, anthropicApiKey, gatewayToken])
.apply(([tsAuthKey, apiKey, gwToken]) => {
return `#!/bin/bash
set -e
export DEBIAN_FRONTEND=noninteractive
# System updates
apt-get update
apt-get upgrade -y
# Install Docker
curl -fsSL https://get.docker.com | sh
systemctl enable docker
systemctl start docker
# Create ubuntu user (Hetzner uses root by default)
useradd -m -s /bin/bash -G docker ubuntu || true
# Install NVM and Node.js for ubuntu user
sudo -u ubuntu bash set -e
cd ~
# Install NVM
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash
# Load NVM
export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && . "$NVM_DIR/nvm.sh"
# Install Node.js 22
nvm install 22
nvm use 22
nvm alias default 22
# Install OpenClaw
npm install -g openclaw@latest
# Add NVM to bashrc if not already there
if ! grep -q 'NVM_DIR' ~/.bashrc; then
echo 'export NVM_DIR="$HOME/.nvm"' >> ~/.bashrc
echo '[ -s "$NVM_DIR/nvm.sh" ] && . "$NVM_DIR/nvm.sh"' >> ~/.bashrc
fi
UBUNTU_SCRIPT
# Set environment variables for ubuntu user
echo 'export ANTHROPIC_API_KEY="${apiKey}"' >> /home/ubuntu/.bashrc
# Install and configure Tailscale
echo "Installing Tailscale..."
curl -fsSL https://tailscale.com/install.sh | sh
tailscale up --authkey="${tsAuthKey}" --ssh || echo "WARNING: Tailscale setup failed. Run 'sudo tailscale up' manually."
# Enable systemd linger for ubuntu user (required for user services to run at boot)
loginctl enable-linger ubuntu
# Start user's systemd instance (required for user services during cloud-init)
systemctl start user@1000.service
# Run OpenClaw onboarding as ubuntu user (skip daemon install, do it separately)
echo "Running OpenClaw onboarding..."
sudo -H -u ubuntu ANTHROPIC_API_KEY="${apiKey}" GATEWAY_PORT="${gatewayPort}" bash -c '
export HOME=/home/ubuntu
export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && . "$NVM_DIR/nvm.sh"
openclaw onboard --non-interactive --accept-risk \
--mode local \
--auth-choice apiKey \
--gateway-port $GATEWAY_PORT \
--gateway-bind loopback \
--skip-daemon \
--skip-skills || echo "WARNING: OpenClaw onboarding failed. Run openclaw onboard manually."
'
# Install daemon service with XDG_RUNTIME_DIR set
echo "Installing OpenClaw daemon..."
sudo -H -u ubuntu XDG_RUNTIME_DIR=/run/user/1000 bash -c '
export HOME=/home/ubuntu
export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && . "$NVM_DIR/nvm.sh"
openclaw daemon install || echo "WARNING: Daemon install failed. Run openclaw daemon install manually."
'
# Configure gateway for Tailscale Serve (trustedProxies + skip device pairing + set token)
echo "Configuring gateway for Tailscale Serve..."
sudo -H -u ubuntu GATEWAY_TOKEN="${gwToken}" python3 import json
import os
config_path = "/home/ubuntu/.openclaw/openclaw.json"
with open(config_path) as f:
config = json.load(f)
config["gateway"]["trustedProxies"] = ["127.0.0.1"]
config["gateway"]["controlUi"] = {
"enabled": True,
"allowInsecureAuth": True
}
config["gateway"]["auth"] = {
"mode": "token",
"token": os.environ["GATEWAY_TOKEN"]
}
with open(config_path, "w") as f:
json.dump(config, f, indent=2)
print("Configured gateway with trustedProxies, controlUi, and token")
PYTHON_SCRIPT
# Enable Tailscale HTTPS proxy (requires HTTPS to be enabled in Tailscale admin console)
echo "Enabling Tailscale HTTPS proxy..."
tailscale serve --bg ${gatewayPort} || echo "WARNING: tailscale serve failed. Enable HTTPS in your Tailscale admin console first."
echo "OpenClaw setup complete!"
`;
});
const server = new hcloud.Server("openclaw-server", {
serverType: serverType,
location: location,
image: "ubuntu-24.04",
sshKeys: [hcloudSshKey.id],
firewallIds: [firewall.id.apply(id => Number(id))],
userData: userData,
labels: {
purpose: "openclaw",
},
});
export const ipv4Address = server.ipv4Address;
export const privateKey = sshKey.privateKeyOpenssh;
// Construct the Tailscale MagicDNS hostname from the server name
// Hetzner servers use their name as the hostname
const tailscaleHostname = server.name;
export const tailscaleUrlWithToken = pulumi.interpolate`https://${tailscaleHostname}.${tailnetDnsName}/?token=${gatewayToken}`;
export const gatewayTokenOutput = gatewayToken;
You can find both programs in the Pulumi examples repo under openclaw/:
[
** github.com/pulumi/examples
](https://github.com/pulumi/examples)
Before deploying, let’s compare the costs between AWS and Hetzner for running OpenClaw 24/7:
AWS (t3.medium) Hetzner (cax21)
vCPUs 2 4
Memory 4 GB 8 GB
Storage 30 GB gp3 (+$2.40/mo) 80 GB NVMe (included)
Traffic Pay per GB 20 TB included
Architecture x86 (Intel/AMD) ARM (Ampere)
Hourly price $0.0416 €0.0104 (~$0.011)
Monthly price
$33 (with storage)
€6.49 ($7)
Annual cost ~$396 ~$84
Hetzner gives you double the vCPUs, double the RAM, at less than a quarter of the price. The trade-off? ARM architecture instead of x86. But OpenClaw doesn’t care - it’s just Node.js and Docker.
**
Prices are for on-demand instances as of January 2026. AWS prices are for us-east-1; Hetzner prices exclude VAT. Both include standard networking and storage. Check AWS EC2 pricing and Hetzner Cloud pricing for current rates.
With your ESC environment configured in Pulumi.dev.yaml, deploy with:
pulumi up
After deployment completes, you’ll see outputs similar to:
Outputs:
gatewayTokenOutput : "786c099cc8f8bf20dbebf40b8b51b75cf5cdab25..."
privateKey : [secret]
publicDns : "ec2-x-x-x-x.compute-1.amazonaws.com"
publicIp : "x.x.x.x"
tailscaleUrlWithToken: "https://ip-10-0-1-x.tailxxxxx.ts.net/?token=786c099..."
The tailscaleUrlWithToken output provides the complete URL with authentication token. Copy and paste it into your browser to access the OpenClaw web UI.
**
Output names vary slightly between providers: AWS uses publicIp and publicDns, while Hetzner uses ipv4Address. The Tailscale hostname is derived from the instance’s private IP (AWS) or server name (Hetzner).
The Pulumi program runs OpenClaw’s non-interactive onboarding during instance provisioning. It uses your Anthropic API key from ESC, binds the gateway to loopback with Tailscale Serve as the HTTPS proxy, generates a secure gateway token (exported in Pulumi outputs), installs the daemon as a systemd user service, and configures trustedProxies and controlUi.allowInsecureAuth to skip device pairing when accessed via Tailscale.
The cloud-init script runs openclaw onboard --non-interactive with all necessary flags, then configures the gateway for secure Tailscale access. Your instance is ready as soon as provisioning finishes.
**
Please wait 3-5 minutes after pulumi up completes. The cloud-init script runs after the instance launches and needs time to install Docker, Node.js, OpenClaw, and Tailscale, then run the onboarding process and start the daemon. Periodically refresh the page until it loads. If the page still doesn’t load after 5 minutes, see Verify the deployment to troubleshoot.
The easiest way to access the OpenClaw web UI is to use the tailscaleUrlWithToken output from Pulumi:
# Get the full URL with token
pulumi stack output tailscaleUrlWithToken
Copy and paste this URL into your browser. The URL includes both the Tailscale MagicDNS hostname and the authentication token, so you can access the web UI directly.
**
Finding your Tailnet DNS name: Go to the Tailscale admin console and look under the DNS section for your tailnet name (e.g., tailxxxxx.ts.net). You can also find your machines and their MagicDNS hostnames in the Machines tab.
**
Token-based authentication provides an additional layer of security on top of Tailscale’s network-level authentication. Only devices on your Tailnet can reach the URL, and the token prevents unauthorized access if someone gains access to your Tailnet.
From the web UI, you can connect messaging channels (WhatsApp, Discord, Slack), configure skills and integrations, and manage settings.
If you encounter issues accessing the web UI, you can SSH into your instance to troubleshoot:
# Check your Tailscale admin console for the new machine
ssh ubuntu@
# Check OpenClaw gateway status
systemctl --user status openclaw-gateway
Open the gateway dashboard using the tailscaleUrlWithToken output and use the built-in chat to test your assistant:
Your personal AI assistant is now running 24/7 on your own infrastructure, accessible securely through Tailscale.
When self-hosting an AI assistant, security matters. OpenClaw’s rapid adoption meant thousands of instances spun up in days, and not everyone locked them down. The community noticed:
The tweet isn’t exaggerating. A quick Shodan search shows exposed gateways on port 18789 with shell access, browser automation, and API keys up for grabs:
Don’t let your instance be one of them.
Concern Without Tailscale With Tailscale
SSH access Public (port 22 open) Public fallback + Tailscale SSH
Gateway access Public (port 18789 open) Private (Tailscale only)
Browser control Public (port 18791 open) Private (Tailscale only)
API keys in transit Exposed if gateway accessed over HTTP Protected by Tailscale encryption
Attack surface 3 open ports 1 open port (SSH fallback)
**
SSH remains accessible as a fallback even with Tailscale enabled. This allows you to troubleshoot if Tailscale fails to connect. Once you’ve confirmed Tailscale is working, you can manually remove the SSH ingress rule from your security group for maximum security.
My recommendations:
Always use Tailscale for production
Rotate your auth keys periodically
Use Pulumi ESC for secrets instead of hardcoding
Enable Tailscale SSH to avoid managing keys manually
Monitor your Tailscale admin console for unauthorized devices
Remove the SSH fallback after confirming Tailscale works if you want zero public ports
Now that OpenClaw is running, you can install skills (voice generation, video creation, browser automation), set up scheduled tasks with cron, invite colleagues to your Tailnet for shared access, or connect additional channels like WhatsApp and Discord.
Deploying OpenClaw with infrastructure as code means you can reproduce your setup anytime, version control it, and tear it down with a single command. Adding Tailscale keeps it private - no exposed ports, no hoping you configured your firewall correctly at 2am.
If you run into issues or have questions, drop by the Pulumi Community Slack or GitHub Discussions.
New to Pulumi? Get started here.
Pulumi IaC gives us a declarative interface to updates. When we perform an update, Pulumi calculates the difference between your currently deployed infrastructure and what is being proposed, then deploys only what is required to migrate from the old state to the new state. Normally, this is exactly what we want: we minimize the amount of work required to perform the update, and don’t recreate anything unnecessarily. However, every now and then, we want to override this behavior.
Recently, we talked about the new replaceWith option, which allows us to tell Pulumi that any replacement of one resource should always trigger a replacement of another (for example, if the database is replaced, we should replace our application server). Today, we’re going to take this idea one step further and talk about another new feature that gives us even more control over this process: replacement triggers.
Let’s imagine we have a resource that, upon creation, generates some private keys. If that’s all the resource does, it probably isn’t ever going to change as far as Pulumi’s concerned: we wanted the resource before, and we want it after, so there is no change and no need to replace the resource that is already live. Now, let’s imagine that we want to cycle these private keys every month. How do we tell Pulumi that, if this update happens more than two weeks after the last time this resource was created, we want to recreate it?
As another example, perhaps we want to replace a resource every time some external version number is bumped. Let’s imagine a documentation server may need to be replaced every time a new version of an API is made available. We could use the --replace flag each time, but this process is error-prone to do manually, and would incur a maintenance burden to automate.
In essence, a replacement trigger is just a value attached to the resource as metadata. However, whenever this value changes between updates, it will trigger a replace operation on the given resource, regardless of whether anything else has changed.
Let’s take our previous example: we want something to be replaced every month. With replacement triggers, we can solve this by representing the current year and month as a string:
...
const today = new Date()
const keyManager = new KeyManagerResource("key-manager", {}, {
replacementTrigger: today.getMonth() + '-' + today.getFullYear()
});
...
...
today = datetime.now()
trigger = f"{today.month}-{today.year}"
key_manager = KeyManagerResource("key-manager", {},
opts=pulumi.ResourceOptions(replacement_trigger=trigger))
...
...
today := time.Now()
trigger := fmt.Sprintf("%d-%d", int(today.Month()), today.Year())
keyManager, err := NewKeyManagerResource(ctx, "key-manager", &KeyManagerResourceArgs{},
pulumi.ReplacementTrigger(pulumi.Any(trigger)))
...
...
var today = LocalDate.now();
var trigger = String.format("%d-%d", today.getMonthValue(), today.getYear())
var keyManager = new KeyManagerResource("key-manager",
KeyManagerResourceArgs.Empty,
CustomResourceOptions.builder()
.replacementTrigger(Output.of(trigger))
.build());
...
...
var today = DateTime.Now;
var trigger = $"{today.Month}-{today.Year}";
var keyManager = new KeyManagerResource("key-manager", new KeyManagerResourceArgs(),
new CustomResourceOptions
{
ReplacementTrigger = Output.Create(trigger)
});
...
When we run this update for the first time, the replacement trigger will be persisted to the Pulumi state. Each time we run an update, we’ll re-calculate the date string, and compare it against the current string. Finally, when we run the update again next month, and the date string no longer matches, our key-manager will be replaced and our new keys will be generated!
This feature is fully supported across all our SDKs as of v3.215.0. For more information about resource options, see the resource options documentation.
Thanks for reading, and feel free to reach out with any questions via GitHub, X, or our Community Slack.
The barrier to migrating to Pulumi has always been the infrastructure you already have. Your existing resources can’t be disrupted, and manually importing them into a new tool is risky and time-consuming. Today, we’re excited to share how Neo removes this barrier entirely with automated, zero-downtime migration to Pulumi from AWS CDK, AWS CloudFormation, Terraform, CDKTF, and Azure ARM templates.
The promise of Infrastructure as Code is that your code perfectly describes your running infrastructure. But switching IaC tools breaks this promise in dangerous ways.
When you rewrite your infrastructure code in a new tool, you have two choices, both problematic. You can destroy and recreate all your resources to match the new code, accepting downtime and risk. Or you can try to import existing resources, which requires perfect knowledge of how every resource maps between the old and new systems. Many teams get stuck here, wanting Pulumi’s modern platform but unable to safely make the switch.
The key to safe migration isn’t just converting code - it’s understanding the complete relationship between your existing IaC tool’s state and your actual cloud resources. Each IaC tool maintains this relationship differently: CDK through CloudFormation stacks, Terraform/CDKTF through state files, and ARM through Azure deployments. But they all have complete knowledge of what they manage.
Neo leverages this existing state knowledge to orchestrate perfect resource transitions. Instead of asking you to manually find resource IDs and construct migration commands, Neo reads your current tool’s state, discovers every resource’s physical identity, and brings them under Pulumi management. This isn’t just automation - it’s using the source tool’s own knowledge against the migration problem.
In practice, this means Neo can bridge the gap between how tools name resources and where they actually live. A Lambda function that CDK knows as OrderHandler9I0J1K2L actually exists in AWS as my-app-OrderHandler-9I0J1K2L, while a Terraform resource at address aws_instance.web[2] maps to EC2 instance i-0abc123def456. Neo understands these mappings and handles the complex cases like composite IDs (FunctionName|StatementId), resource references, and dependency chains that must be migrated in order.
Because Neo uses the source tool’s own state knowledge, your infrastructure doesn’t change at all during migration. Not a single resource is modified, recreated, or even touched. We’re simply transferring ownership from one IaC tool to another.
This approach delivers three critical guarantees:
Zero downtime: Resources are never deleted or recreated
Zero risk: Since nothing changes, you can abandon the migration at any point without consequence
Zero surprises: Preview confirms no infrastructure changes before you commit
Neo adapts its migration strategy to each tool’s unique characteristics while maintaining the same zero-downtime guarantee. Let’s explore how Neo handles migrations from each major IaC tool.
For teams using AWS CDK, Neo leverages the CloudFormation layer that underpins CDK deployments. CDK’s architecture actually makes migration straightforward: since CDK synthesizes to CloudFormation templates, Neo can read the deployed stacks directly to understand every resource and its configuration. For detailed migration steps, see our CDK migration guide.
The challenge with CDK migrations isn’t the CloudFormation layer - it’s the cryptic resource naming. CDK generates logical IDs like OrdersTableA7B2C3D4 that map to physical resources with completely different names. These mappings are buried in CloudFormation metadata, and getting them wrong means either orphaning resources or accidentally creating duplicates. Neo navigates this complexity by reading CloudFormation’s own stack outputs and resource metadata, discovering the exact physical ID for every logical resource.
CDK also introduces complexity through its construct hierarchy. A single high-level construct might expand into dozens of CloudFormation resources, each with dependencies and references to others. Neo preserves these relationships during migration, ensuring that IAM roles still reference the right Lambda functions, API Gateway deployments still point to the correct stages, and security groups maintain their exact rules. The migration completes with your infrastructure unchanged and Pulumi’s preview confirming zero modifications.
[
Start a Neo task Migrate your CDK application
**
For teams using CloudFormation directly (rather than through CDK), Neo provides a streamlined migration path. CloudFormation stacks contain complete resource metadata - every resource’s logical ID maps to a physical resource in AWS, and CloudFormation tracks these relationships in its stack state. Neo reads this state directly to build a complete picture of your infrastructure. For detailed migration steps, see our CloudFormation migration guide.
The main challenge with CloudFormation migrations is the template language itself. CloudFormation templates use intrinsic functions like !Ref, !GetAtt, and !Sub that create implicit dependencies between resources. A security group might reference a VPC using !Ref MyVpc, while a Lambda function’s role uses !GetAtt LambdaRole.Arn. Neo evaluates these expressions against the actual deployed stack to resolve every reference to its concrete value.
CloudFormation also supports features like conditionals, mappings, and nested stacks that add layers of indirection. Neo handles these by examining what actually got deployed rather than trying to interpret every possible template path. The result is Pulumi code that manages your exact infrastructure configuration - not a theoretical interpretation of your template. The migration completes with pulumi preview confirming zero changes to your running resources.
[
Start a Neo task Migrate your CloudFormation stack
**
Terraform and CDKTF migrations require two transformations: converting state to establish Pulumi’s connection to your resources, and transforming HCL configuration into Pulumi code. The state conversion is direct - Neo reads your Terraform state and converts it into Pulumi state, preserving the mappings between resource names and cloud IDs. This ensures Pulumi knows that aws_instance.web[2] corresponds to EC2 instance i-0abc123def456.
The code transformation analyzes your HCL to generate equivalent Pulumi code. Neo handles Terraform patterns like count and for_each loops, module structures, and resource dependencies, recreating them idiomatically in Pulumi. The generated code manages your existing infrastructure without modifications - pulumi preview confirms zero changes. For complete migration instructions, see our Terraform migration guide.
[
Start a Neo task Migrate your Terraform application
**
ARM templates present unique migration challenges. Unlike CDK and Terraform, which maintain clear separation between code and state, ARM templates blur this line. The template is both the definition and, through deployment history, part of the state tracking. ARM’s template expression language, with its concat functions and resource ID constructors, makes it difficult to determine what resources actually exist until deployment time. For step-by-step migration guidance, see our ARM migration guide.
Neo orchestrates ARM migrations through intelligent AI-driven conversion. When an ARM template uses functions like concat(parameters('appName'), '-plan'), the conversion process evaluates these expressions using the actual parameter values to generate the correct resource names. Azure resource IDs follow predictable patterns - subscription IDs, resource groups, providers, and resource names - and Neo ensures these are correctly brought under Pulumi management using inline resource specifications directly in the generated code.
The biggest challenge with ARM migrations is handling the implicit dependencies and resource provider quirks. An App Service might implicitly create a service plan, a SQL database requires a server that might be defined in a linked template, and child resources like application settings need separate migration steps. Neo understands these Azure-specific patterns and generates the appropriate Pulumi code to manage every resource. The migration completes with a zero-diff preview, confirming your exact Azure configuration is preserved while giving you a more maintainable, type-safe way to manage it going forward.
[
Start a Neo task Migrate your ARM template
**
While each tool requires specific handling, Neo’s core architecture remains consistent:
Neo acts as the intelligent migration coordinator, regardless of source tool:
Credential verification: Ensures proper cloud credentials are configured in Pulumi ESC
Resource inventory: Builds a complete catalog of existing resources
Conversion orchestration: Manages the code transformation
State migration: Brings existing resources under Pulumi management
Audit trail generation: Creates comprehensive migration reports
The state management engine works consistently across all tools:
Maps source resource IDs to Pulumi’s state management
Handles complex and composite resource identifiers
Provides fallback strategies for edge cases
Ensures idempotent operations
While Neo automates the heavy lifting, we maintain human checkpoints:
Review generated code before migrating resources
Verify preview shows zero changes
Approve the resulting pull request
Migration friction no longer locks you into your current IaC tool. If you want Pulumi’s programming model, policy engine, and multi-cloud support, Neo gets you there without disrupting your infrastructure.
Ready to migrate? Check out our migration guides for CDK, CloudFormation, Terraform, or Azure ARM. Join us in the Pulumi Community Slack or reach out to your account team for a guided migration session.