In my previous post about the Ralph Wiggum technique, Claude Code built a complete serverless URL shortener on AWS. The setup included Playwright MCP for end-to-end testing. It worked. But I kept wondering if there was something better for AI-driven browser automation. Then Vercel released agent-browser, and I had to try it.
When an AI coding agent builds a frontend, someone has to verify it works. Without browser automation, that someone is you. The AI finishes and says “done,” but you can’t trust that claim until you open a browser and click around yourself.
Browser automation changes this. The AI verifies its own work. It builds a component, launches a browser, tests the interaction, confirms behavior matches expectations. If something breaks, it fixes the code and tests again. The validation loop runs without you.
This matters more as AI agents take on larger tasks. A quick component fix might not need automated testing. But when an agent builds an entire dashboard from scratch, you need proof that the deployed application actually works.
Playwright MCP gives you powerful browser control, but it comes with overhead. Every screenshot, every DOM snapshot, every accessibility tree adds tokens to your context window. GitHub issue #889 documents a 6x token increase between versions 0.0.30 and 0.0.32. Users report single screenshots consuming over 15,000 tokens. Some exhausted their entire five-hour token allocation in just a few automation steps.
The problem is verbose output. Full accessibility trees contain every element on the page with all their properties. Console message logging adds more. Each piece of information might be useful for debugging, but together they overwhelm the context window.
This creates a frustrating tradeoff. You want rich browser interaction for thorough testing, but that richness costs tokens. Fewer tokens means fewer iterations before hitting limits.
Vercel ran into something similar with their D0 text-to-SQL agent. Their research documents what happened when they reduced tool complexity.
The original architecture used 17 specialized tools. Each tool handled one specific operation: create tables, insert rows, run queries, validate schemas. The approach seemed logical. Specialized tools should produce better results than general ones.
The numbers told a different story. With 17 tools, they got 80% success (4 out of 5 queries), averaging 274.8 seconds and ~102,000 tokens. The worst case took 724 seconds, burned 145,463 tokens, and still failed.
Then they rebuilt with just two tools: ExecuteCommand and ExecuteSQL. Success jumped to 100%. Average execution dropped to 77.4 seconds (3.5x faster). Token usage fell to ~61,000 (37% reduction). The worst case took 141 seconds, used 67,483 tokens, and succeeded.
Vercel’s takeaway: “We were constraining reasoning because we didn’t trust the model to reason.” Fewer tools meant the model could think more freely about how to accomplish tasks. The simplicity reduced confusion and context waste.
agent-browser takes the same approach. Instead of separate tools for clicking, typing, scrolling, and navigating, it has a unified CLI with one clever idea: the snapshot + refs system.
When you request a page snapshot, agent-browser returns something like this:
- button "Sign In" [ref=e1]
- textbox "Email" [ref=e2]
- textbox "Password" [ref=e3]
- link "Documentation" [ref=e4]
Those @e1, @e2 references are stable element identifiers. To click the sign-in button, you run click @e1. No CSS selectors. No XPath expressions. No waiting for the DOM to stabilize. The reference points to that exact element.
This cuts out the verbosity of full accessibility trees. You get just enough information to understand the page structure and interact with elements. The AI can reason about what it sees without drowning in metadata.
Setup is simple:
npm install -g agent-browser && agent-browser install
This installs the CLI and downloads a bundled browser. You don’t need to install any additional browsers.
For Claude Code integration, install the official skill from the agent-browser package. You can either copy it from node_modules:
cp -r node_modules/agent-browser/skills/agent-browser .claude/skills/
Or download it directly:
mkdir -p .claude/skills/agent-browser
curl -o .claude/skills/agent-browser/SKILL.md \
https://raw.githubusercontent.com/vercel-labs/agent-browser/main/skills/agent-browser/SKILL.md
The skill gives Claude Code better context than just running agent-browser --help.
Unlike Playwright MCP, there’s no server configuration. Agent-browser runs as a standalone CLI that Claude Code invokes directly through bash.
I wanted a realistic test scenario, so I added an analytics dashboard to the url-shortener-saas project from my previous post. The feature tracks click events with metadata (browser, device, country, referrer) and displays them through interactive charts.
This follows the same “Ralph Wiggum” workflow from my previous post: write a feature-prompt.md that describes what you want, then let Claude Code implement, deploy with Pulumi, and verify with browser automation. Here’s a snippet from the feature prompt:
## Backend Requirements
### New DynamoDB Table: Analytics Events
Create a new DynamoDB table `url-shortener-analytics-{stack}` with:
- Partition Key: shortCode (String)
- Sort Key: timestamp (String, ISO 8601 format)
- Attributes: browser, deviceType, country, referrer, etc.
### New Lambda: analytics.ts
GET /api/analytics/{shortCode}
- Returns timeline, browsers, devices, countries, referrers
- Query params: from, to, granularity
## Success Criteria
1. `pulumi up` deploys successfully without errors
2. All existing E2E tests still pass
3. New analytics E2E tests pass
4. Charts render with real data after generating test clicks
Claude Code reads this, writes the Lambda handlers, updates the Pulumi infrastructure code, and deploys. The infrastructure changes are straightforward:
// Analytics events table
const analyticsTable = new aws.dynamodb.Table("analytics-table", {
name: `${projectName}-analytics-${stack}`,
billingMode: "PAY_PER_REQUEST",
hashKey: "shortCode",
rangeKey: "timestamp",
attributes: [
{ name: "shortCode", type: "S" },
{ name: "timestamp", type: "S" },
],
globalSecondaryIndexes: [{
name: "by-date",
hashKey: "shortCode",
rangeKey: "timestamp",
projectionType: "ALL",
}],
});
After pulumi up succeeds, the agent needs to verify the deployed feature works. That’s where browser automation comes in. The feature prompt specifies verification requirements:
## End-to-End Verification
After deployment, use the `/agent-browser` skill to verify:
1. Analytics link appears in navigation
2. Analytics page loads without errors
3. Charts render (timeline, browser, device, geographic)
4. URL-specific analytics work from dashboard
5. Click data appears after generating test clicks
The test suite covered six scenarios: homepage load, URL shortening, dashboard view, analytics navigation, analytics overview, and date filter functionality. Nothing fancy, just the basics you’d want to verify after deploying.
I ran the exact same tests with both Playwright MCP and agent-browser to see how much context each consumed.
Here’s what the CLI interaction looked like during my test run.
Navigate to the deployed application:
$ agent-browser open https://d1232drths1aav.cloudfront.net
✓ Snip.ly - Modern URL Shortener
https://d1232drths1aav.cloudfront.net/
Snapshot to see available elements (the -i flag filters to interactive elements only):
$ agent-browser snapshot -i
- link "S Snip.ly" [ref=e1]
- link "Home" [ref=e2]
- link "Dashboard" [ref=e3]
- link "Analytics" [ref=e4]
- link "Docs" [ref=e5]
- button "Switch to dark mode" [ref=e6]
- link "Get Started" [ref=e7]
- textbox "Paste your long URL here..." [ref=e8]
- button "Shorten URL" [ref=e9]
That entire homepage snapshot is 280 characters. Playwright MCP returned 8,247 characters for the same page.
Fill the URL input and click the shorten button:
$ agent-browser fill @e8 "https://example.com/agent-browser-e2e-test"
✓ Done
$ agent-browser click @e9
✓ Done
Each action confirmation is 6 characters. Playwright MCP returns the full page state after every click - 12,891 characters when I clicked the shorten button.
Navigate to analytics:
$ agent-browser click @e4
✓ Done
$ agent-browser wait --load networkidle && agent-browser snapshot -i
✓ Done
- link "S Snip.ly" [ref=e1]
- link "Home" [ref=e2]
- link "Dashboard" [ref=e3]
- link "Analytics" [ref=e4]
- button "Last 7 days" [ref=e8]
- button "Last 30 days" [ref=e9]
- button "Last 90 days" [ref=e10]
- link "1 analytics-e2e-test https://example.com/analytics-test-verification 5" [ref=e11]
- link "2 test-url https://www.example.com/this-is-a-very-long-url... 4" [ref=e12]
The analytics snapshot shows date filter buttons and top URLs with click counts. 385 characters versus Playwright MCP’s 4,127.
Test the date filter:
$ agent-browser click @e9
✓ Done
$ agent-browser screenshot analytics-30-day-filter.png
✓ Screenshot saved to analytics-30-day-filter.png
The ref-based workflow is deterministic. Each command operates on a specific element from the snapshot. No guessing about selectors.
Same six tests, both tools:
Metric Playwright MCP Agent-Browser Reduction
Total response characters 31,117 5,455 82.5%
Largest single response 12,891 2,847 77.9%
Average response size 3,112 328 89.5%
Homepage snapshot 8,247 280 96.6%
Dashboard snapshot 12,891 2,847 77.9%
The difference is what each tool returns after every action.
Playwright MCP returns the full accessibility tree after every click:
### Page state
- Page URL: https://d1232drths1aav.cloudfront.net/dashboard
- Page Title: Snip.ly - Modern URL Shortener
- Page Snapshot:
- generic [ref=e2]:
- banner [ref=e3]:
- generic [ref=e4]:
- link "S Snip.ly" [ref=e5] [cursor=pointer]:
- /url: /
- generic [ref=e6]: S
- generic [ref=e7]: Snip.ly
- navigation [ref=e8]:
- link "Home" [ref=e9] [cursor=pointer]:
- /url: /
- text: Home
...
[12,891 characters continues...]
Agent-browser returns:
✓ Done
6 characters versus 12,891 for the same button click.
Six tests consumed ~31K characters with Playwright MCP versus ~5.5K with agent-browser. At roughly 4 characters per token, that’s ~7,800 tokens versus ~1,400. An AI agent could run 5.7x more tests in the same context budget.
Past the numbers, working with agent-browser felt different in ways that are hard to quantify.
The compact snapshots changed the rhythm of iteration. A typical page fit in a few hundred tokens instead of thousands, so Claude Code could run more test cycles before hitting limits. I spent less time worrying about context and more time actually testing.
Element refs removed a frustration I’d had with Playwright. CSS selectors break when someone changes a class name or restructures the DOM. With agent-browser, a button is just @e9. It doesn’t care about styling. That predictability was a relief.
What worked well:
Snapshots stay small enough that you don’t think about them
Refs mean no selector debugging
The CLI just works with Claude Code’s bash tool
No MCP server to configure
Agent-browser is early. Documentation is thin, and I read the source more than once to figure out edge cases.
Where I hit friction:
Modals that appear after API calls needed manual wait logic
Playwright’s waiting mechanisms are more mature
Hidden or dynamically loaded elements sometimes didn’t show up without explicit waits
For my URL shortener tests, agent-browser used fewer tokens per cycle, and the ref-based approach was more predictable than selector-driven automation.
Playwright MCP still wins on depth. Network interception, multi-tab handling, PDF generation, better waiting logic—if you need those, you need Playwright. For complex browser automation, it’s the more capable tool.
Agent-browser fits long autonomous sessions where context budget matters, basic navigation and verification tasks, and setups where you want to skip MCP configuration. CLI-based skills also avoid the schema overhead—tool definitions for multi-tool MCPs eat tokens even when idle.
Playwright MCP fits when you need the advanced stuff: network interception, PDF generation, multi-tab workflows, or sophisticated synchronization. It’s also the right choice if you have existing Playwright test suites. If the full MCP schema feels heavy, Playwright skills offer a lighter wrapper.
Start with agent-browser for AI validation loops. Move to Playwright when you outgrow it.
Pulumi ESC is Pulumi Cloud’s centralized solution for managing secrets and configuration across every vault and cloud provider you use. It helps teams secure their configuration while adopting modern best practices like short-lived credentials with OIDC and automated secret rotation.
Whether you’re configuring Pulumi programs, powering applications and services, or managing credentials for tools like the AWS CLI, ESC provides a single, consistent way to do it safely and at scale.
Behind the scenes, ESC integrates with multiple cloud providers and secret managers, supports composable environments, and offers rich built-in functions, from simple value transformations to encoding files as Base64.
With this level of power, usability matters more than ever. That’s why today we’re introducing the new and improved Pulumi ESC Web Editor, designed to make managing secrets and configuration easier, faster, and more intuitive.
Today, you can create and manage your Pulumi ESC configuration in multiple ways, such as using the CLI set and edit commands, or through our VS Code extension. For many users, however, their first experience with ESC happens in the Pulumi Cloud Console.
Based on feedback from users of both our YAML Document view and Table view in the Console, we’ve been working hard to create a new and improved unified editor experience that makes ESC even easier to work with. One of the most notable improvements is a brand new Inspect tab that lets you easily edit secrets and gain deeper insights into your configuration. With this new UI, you can now freely switch between writing YAML and using rich UI elements to manipulate your environment—and the editor keeps everything in sync, with clear, in-context information about what you’re doing and what’s possible at every step.
Let’s explore some of these use cases!
Adding secrets is now as simple as selecting Secret from the Add new menu.
The Inspect tab lets you view and edit your secret securely, automatically encrypting it as ciphertext in your environment definition. No more worrying about accidentally exposing sensitive values!
ESC offers a large library of providers and built-in functions to use in your environment. The new editor makes discovering and using them effortless.
When you add a provider or function, the editor inserts it with example values to get you started quickly. The Inspect tab provides instant access to documentation, so you can more easily configure the integrations.
Consuming your configuration where you need it is now easier than ever. The Export menu in the Inspect sidebar lets you quickly expose values as Pulumi config for your stacks, or as environment variables in your shell.
The new Pulumi ESC Editor brings together the best of both worlds: the power of the YAML editor with the ease of UI controls. Try it out today in the Pulumi Cloud Console and let us know what you think!
We’re excited to announce that Pulumi Identity and Access Management (IAM) is now available for self-hosted instances of Pulumi Cloud. This foundational security capability brings the same enterprise-grade access management we launched for Pulumi Cloud SaaS to organizations running Pulumi on their own infrastructure.
Self-hosted Pulumi Cloud customers can now leverage the full power of Custom Roles and Granular Access Tokens to implement Zero Trust security principles and least privilege access controls within their own environments. This means you can:
Define Custom Permissions with fine-grained scopes (e.g., stack:delete, environment:read) tailored to your organization’s security requirements.
Create Custom Roles by combining these permissions with specific Pulumi entities (Stacks, Environments, etc.).
Generate Scoped Organization Access Tokens that are precisely limited to the permissions defined in their associated roles, dramatically reducing the blast radius if credentials are compromised.
This release is powerful for self-hosted Pulumi Cloud deployments where security and compliance requirements are often even more stringent. You can now:
Implement Least Privilege CI/CD: Scope pipeline tokens to only the actions and resources they absolutely need, ensuring your automation follows the same security standards as your infrastructure.
Enhance Compliance Posture: Demonstrate precise, auditable control over programmatic access to auditors and security teams.
Reduce Operational Risk: Limit the potential impact of compromised tokens by restricting them to specific roles and permissions.
Self-hosted customers can access IAM features through the same intuitive interface available in Pulumi Cloud SaaS. Navigate to Settings -> Access Management -> Roles to begin creating Custom Permissions and Custom Roles, then generate scoped Organization Access Tokens from Settings -> Access Management -> Access Tokens.
For detailed information about Pulumi IAM capabilities, including step-by-step guides and best practices, see our comprehensive announcement blog post.
Explore the IAM & RBAC documentation to get started:
We’re committed to bringing enterprise-grade security features to all Pulumi deployments, whether in the cloud or on-premises. If you have questions or feedback, please reach out through your account representative or our GitHub repository.
I was about to do something that felt either genius or completely reckless: hand over my AWS credentials to an AI and step away from my computer. The technique is called “Ralph Wiggum,” named after the Simpsons character who eats glue and says “I’m in danger” while everything burns around him. And honestly, that felt about right for what I was attempting.
If you have spent any time with AI coding assistants, you know the frustration. You are in the middle of a task. It is going great. Claude is writing beautiful infrastructure code, understanding your architecture, making progress… and then it says “I have completed your request.”
You stare at the screen. “No. No, you have not. You maybe did a third of what I asked.”
So you reprompt it. It does a little more. Then it stops again. You are babysitting. The whole point of using an AI assistant was to save time, but if you have to check on it every five minutes, you are not saving anything. You are just a very expensive supervisor.
This is the problem Geoffrey Huntley decided to solve.
Geoffrey Huntley, a developer based in rural Australia (who, according to internet lore, lives on a property with goats), had a deceptively simple idea: what if every time Claude Code finishes and tries to exit, we just feed it the prompt again?
He named the technique “Ralph Wiggum” because the character embodies a kind of childlike persistence. Ralph repeatedly fails, makes silly mistakes, yet stubbornly continues in an endless loop until he eventually succeeds. Sound familiar? That is exactly how AI coding agents work. They make mistakes. They try again. They iterate.
The philosophy behind Ralph is beautifully stated: “Better to fail predictably than succeed unpredictably.” Every failure is just data for the next iteration.
At its core, the Ralph Wiggum technique is almost embarrassingly simple. It is a bash loop:
while true; do
cat PROMPT.md | claude --print --dangerously-skip-permissions
done
That is it. That is the whole thing. You write your instructions in a PROMPT.md file, and the loop keeps feeding it to Claude Code over and over.
But here is why it works: each time Claude starts up, it looks at the project. It sees the files that exist, the code that is already written, the git history. It is not starting from zero. It picks up where it left off, sees what still needs to be done, and keeps going.
The official Claude Code plugin formalizes this with some important additions:
Completion promises: You tell Claude to output a specific string like COMPLETE when it has genuinely finished
Max iterations: A safety limit so your API bill does not reach infinity
Stop hooks: The mechanism that intercepts exit attempts and re-injects the prompt
To install it:
/plugin install ralph-wiggum@ghuntley
For end-to-end testing of deployed infrastructure, you will also want the Playwright MCP server. This lets Claude interact with your deployed application in a real browser:
claude mcp add playwright npx @playwright/mcp@latest
Infrastructure as code has something most application code does not: objective success criteria. Your Lambda either deploys or it does not. Your DynamoDB table either exists or it does not. Tests either pass or they fail.
This makes Pulumi an ideal candidate for autonomous AI development:
Clear completion signals: pulumi preview and pulumi up provide unambiguous feedback
Iterative feedback: Failed deployments tell Claude exactly what went wrong
Testable outputs: You can write integration tests that verify your infrastructure actually works
Registry access: With the Pulumi MCP server, Claude has real-time access to documentation and examples
I decided to build a serverless URL shortener on AWS. Not because the world needs another URL shortener, but because it is the perfect test case: multiple AWS services, clear requirements, and objective success criteria. But I did not want just a basic demo. I wanted a full SaaS-like experience with a polished frontend.
Here is the PROMPT.md file I wrote:
# Build a Production-Ready URL Shortener SaaS on AWS
Using Pulumi TypeScript, create a complete URL shortener SaaS with a polished frontend experience.
## Infrastructure Requirements
- DynamoDB table for storing URL mappings (shortCode -> originalUrl, clickCount, createdAt)
- Lambda functions for: creating short URLs, redirecting to original URLs, getting stats
- API Gateway REST API exposing the Lambda functions
- S3 bucket for the React frontend with static website hosting
- CloudFront distribution with HTTPS for the frontend and API
- Use the Pulumi ESC environment pulumi-idp/auth for the AWS credentials
## Frontend Requirements (Use /frontend-design skill)
Create a polished, production-ready SaaS website with:
### Landing Page
- Hero section with catchy headline and URL input form
- Feature highlights (fast redirects, analytics, custom aliases)
- Pricing section (Free tier, Pro tier, Enterprise tier)
- Testimonials section with 3 fake but realistic customer reviews
- Footer with links to Docs, Privacy, Terms
### Dashboard Page
- URL shortening form with optional custom alias
- List of created short URLs with click counts
- Copy-to-clipboard functionality
- Delete URL option
### Documentation Page
- Getting started guide
- API reference (POST /shorten, GET /{code}, GET /stats/{code})
- Rate limits and usage policies
- FAQ section
### Design Requirements
- Modern, clean aesthetic (no generic AI look)
- Responsive design (mobile, tablet, desktop)
- Dark mode support
- Smooth animations and transitions
- Consistent color scheme and typography
## API Requirements
- POST /shorten: accepts { url: string, alias?: string }, returns { shortCode, shortUrl }
- GET /{shortCode}: redirects to original URL (301)
- GET /stats/{shortCode}: returns { originalUrl, clickCount, createdAt }
## Success Criteria
- All unit tests pass
- All integration tests pass
- pulumi preview shows no errors
- pulumi up deploys successfully
## End-to-End Verification (Required)
After deployment, use Playwright MCP to verify the live site:
1. Open the CloudFront URL and verify the landing page loads with all sections
1. Navigate to the docs page and verify content renders
1. Create a short URL using the dashboard
1. Verify the short URL redirects correctly (301 status)
1. Check the stats show the click was recorded
1. Test responsive design by resizing the viewport
1. Take screenshots of landing page, dashboard, and docs as proof
Only output COMPLETE after ALL of the above including E2E tests pass.
The key addition here is the end-to-end verification section. Without it, Claude reports success after pulumi up finishes, but you have no guarantee the deployed infrastructure actually works. By requiring Playwright to test the live URLs, you catch issues like the CloudFront 403 error I mentioned earlier.
Before running the loop, I started Claude Code with permission bypass mode to prevent it from stopping to ask for approval on every file write or command:
claude --permission-mode bypassPermissions
There are other ways to handle permissions, but this is the simplest for autonomous execution. Then I ran the Ralph loop:
/ralph-wiggum:ralph-loop PROMPT.md --max-iterations 25 --completion-promise "COMPLETE"
Then I stepped away and let it run unsupervised. For complex projects, you could even let it run overnight while you sleep.
Here is Claude Code in action during the Ralph loop, fixing CloudFront configuration issues and writing unit tests:
**
The complete source code for this project is available on GitHub: url-shortener-saas
When Claude finished, it had:
Successfully built:
A complete Pulumi program with proper resource organization
DynamoDB table with a GSI for querying by creation date
Three Lambda functions with proper error handling
API Gateway with CORS configuration
S3 bucket with static website hosting
CloudFront distribution with proper cache behaviors
The frontend was genuinely impressive:
A polished landing page with hero section, feature cards, and pricing tiers
Working testimonials section with three fictional but believable customer reviews
A functional dashboard where you could create and manage short URLs
A documentation page with API reference and getting started guide
Dark mode toggle that actually worked
Responsive design that looked good on mobile
Here is a snippet of what it generated:
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
const stack = pulumi.getStack();
const projectName = "url-shortener";
// DynamoDB Table
const urlTable = new aws.dynamodb.Table("url-table", {
name: `${projectName}-urls-${stack}`,
billingMode: "PAY_PER_REQUEST",
hashKey: "shortCode",
attributes: [{ name: "shortCode", type: "S" }],
tags: {
Environment: stack,
Project: projectName,
ManagedBy: "Pulumi",
},
});
// Lambda function for URL shortening
const shortenFunction = new aws.lambda.Function("shorten-function", {
name: `${projectName}-shorten-${stack}`,
runtime: aws.lambda.Runtime.NodeJS20dX,
handler: "shorten.handler",
role: lambdaRole.arn,
timeout: 30,
memorySize: 256,
code: new pulumi.asset.AssetArchive({
".": new pulumi.asset.FileArchive("./lambda"),
}),
environment: {
variables: {
TABLE_NAME: urlTable.name,
},
},
});
// CloudFront Distribution with S3 and API Gateway origins
const distribution = new aws.cloudfront.Distribution("frontend-distribution", {
enabled: true,
defaultRootObject: "index.html",
origins: [
{
domainName: frontendBucketWebsite.websiteEndpoint,
originId: "S3Origin",
customOriginConfig: {
httpPort: 80,
httpsPort: 443,
originProtocolPolicy: "http-only",
originSslProtocols: ["TLSv1.2"],
},
},
{
domainName: pulumi.interpolate`${api.id}.execute-api.${aws.config.region}.amazonaws.com`,
originId: "APIGatewayOrigin",
customOriginConfig: {
httpPort: 80,
httpsPort: 443,
originProtocolPolicy: "https-only",
originSslProtocols: ["TLSv1.2"],
},
},
],
defaultCacheBehavior: {
targetOriginId: "S3Origin",
viewerProtocolPolicy: "redirect-to-https",
allowedMethods: ["GET", "HEAD", "OPTIONS"],
cachedMethods: ["GET", "HEAD"],
forwardedValues: {
queryString: false,
cookies: { forward: "none" },
},
},
orderedCacheBehaviors: [{
pathPattern: "/api/*",
targetOriginId: "APIGatewayOrigin",
viewerProtocolPolicy: "https-only",
allowedMethods: ["DELETE", "GET", "HEAD", "OPTIONS", "PATCH", "POST", "PUT"],
cachedMethods: ["GET", "HEAD"],
forwardedValues: {
queryString: true,
headers: ["Authorization", "Content-Type"],
cookies: { forward: "none" },
},
minTtl: 0,
defaultTtl: 0,
maxTtl: 0,
}],
viewerCertificate: {
cloudfrontDefaultCertificate: true,
},
});
Where it struggled:
The first few iterations had the Lambda code inline instead of in separate files
It initially forgot to set up IAM permissions for the Lambda to access DynamoDB
The CloudFront configuration took three attempts to get the cache behaviors right
The funny parts:
At iteration 12, it decided to completely refactor the project structure. Then at iteration 13, it refactored it back
One commit message simply read: “fix the fix that fixed the previous fix”
It wrote a test, ran the test, fixed the code to pass the test, then realized the test was wrong
After running several experiments, here is what I have learned:
Write extremely specific prompts. Vague instructions lead to vague implementations. Be explicit about every requirement, every edge case, every success criterion.
Use pulumi preview as your feedback loop. Add this to your success criteria:
Success Criteria:
- pulumi preview completes with no errors
- pulumi preview shows expected resource count (approximately X resources)
Set realistic iteration limits. For infrastructure projects, I have found 20-30 iterations is usually sufficient. More than that and you are probably missing something in your prompt.
Include test requirements. Claude is surprisingly good at test-driven development when you tell it to write tests first:
## Development Approach
1. Write unit tests for each Lambda function first
1. Implement Lambda functions to pass the tests
1. Write integration tests that verify the deployed infrastructure
1. Only output COMPLETE when all tests pass
Mention cost awareness. Add a note like “Use PAY_PER_REQUEST billing for DynamoDB” or “Use the smallest Lambda memory allocation that works.” Claude will optimize for what you tell it to optimize for.
Is this the future of infrastructure development? Probably not entirely. But it is a genuinely useful technique for specific scenarios:
Great for:
Greenfield projects where you have clear requirements
Large migrations or refactors
Building proof-of-concept infrastructure while you focus on other work
Tasks where iteration is more valuable than perfection on the first try
Not great for:
Production infrastructure changes (please do not let an AI modify your production AWS account unsupervised)
Complex architecture decisions that require human judgment
Anything involving sensitive data or security configurations
Small tweaks or quick fixes (overkill)
The API costs were reasonable for what I got. For a complete infrastructure project built autonomously, a few dollars in Claude API calls is a bargain compared to my hourly rate.
If you want to experiment with Ralph Wiggum and Pulumi:
Install Claude Code and the Ralph Wiggum plugin
Set up your Pulumi account and AWS credentials
Write a detailed PROMPT.md with clear success criteria
Start with a small project and low iteration limits
Review the git history when it finishes
The combination of autonomous AI loops and infrastructure as code feels like a glimpse of where development is heading. Not replacing developers, but changing how we use our time. Instead of babysitting an AI through each step, you hand it a well-defined problem and come back to working infrastructure.
Just make sure you have billing alerts set up. Ralph Wiggum does not know when to stop spending your money.
If setting up the Ralph Wiggum plugin feels like too much ceremony, Pulumi has something similar built right into Pulumi Neo.
Neo Tasks work on the same principle: you describe what you want, and Neo plans and executes the infrastructure changes. The key is auto mode. When you set a task to auto mode, Neo runs without requesting approvals. It plans, executes, validates with pulumi preview, and iterates until the task is complete.
The experience is remarkably similar to Ralph:
Persistent execution: Tasks continue running even if you close your browser. Come back later and Neo shows you what it accomplished while you were away.
Iterative refinement: Failed deployments inform the next attempt, just like Ralph re-reading the project state.
Clear completion: Neo knows when the infrastructure matches your requirements.
The main difference is that Neo runs in Pulumi Cloud rather than on your local machine, and it is purpose-built for infrastructure tasks. You get the autonomous loop experience without needing to configure plugins or bash loops.
For teams already using Pulumi Cloud, Neo with auto mode might be the fastest path to autonomous infrastructure development.
ConfigMaps in Kubernetes don’t have built-in revision support, which can create challenges when deploying applications with canary strategies.
When using Argo Rollouts with AWS Spot instances, ConfigMap deletions during canary deployments can cause older pods to fail when they try to reload configuration. We solved this by implementing a custom ConfigMap revision system using Pulumi’s ConfigMapPatch and Kubernetes owner references.
When deploying applications to Kubernetes using canary strategies with Argo Rollouts, we encountered a specific challenge:
Pulumi ConfigMap replacement behavior: By default, when a ConfigMap’s data changes, Pulumi may replace it rather than update it in place, which for auto-named ConfigMaps results in a new generated name (suffix).
Canary deployment issues: During canary deployments, the old ConfigMap gets deleted, but older pods (especially on AWS Spot instances that can be replaced during canary) may fail to reload
No native revision support: Neither Kubernetes nor Pulumi natively supports ConfigMap revisions like they do for deployments
**
Going to KubeCon Europe 2026? Tame Kubernetes complexity with code.
Pulumi demos, platform engineering talks, AI-powered Neo in action, and yes, free plushies. Booth 784.
Our solution leverages Kubernetes’ garbage collection mechanism by using owner references to tie ConfigMaps to ReplicaSets created during canary deployments.
Pulumi makes this approach practical by letting us express patching logic, owner references, and rollout coordination directly in code, instead of encoding complex behavior in static YAML.
Pulumi’s ConfigMapPatch: Patches existing ConfigMaps with owner references
ReplicaSet Owner References: Links ConfigMaps to ReplicaSets for automatic cleanup
Kubernetes Garbage Collection: Automatically cleans up ConfigMaps when ReplicaSets are deleted
Retain on Delete: Protects ConfigMaps from immediate deletion during Pulumi updates
Here’s how we implemented this solution in our rollout component:
import * as pulumi from "@pulumi/pulumi";
import * as k8s from "@pulumi/kubernetes";
import * as k8sClient from "@kubernetes/client-node";
interface RolloutComponentArgs {
namespace: string;
configMapPatch?: boolean;
kubeconfig: pulumi.Outputany>;
configMapName: pulumi.Outputstring>;
rolloutSpec: k8s.types.input.apiextensions.CustomResourceArgs["spec"];
}
export class ConfigMapRevisionRollout extends pulumi.ComponentResource {
public readonly rollout: k8s.apiextensions.CustomResource;
constructor(
name: string,
args: RolloutComponentArgs,
opts?: pulumi.ComponentResourceOptions
) {
super("pulumi:component:ConfigMapRevisionRollout", name, {}, opts);
// Create the Argo Rollout using CustomResource
this.rollout = new k8s.apiextensions.CustomResource(
`${name}-rollout`,
{
apiVersion: "argoproj.io/v1alpha1",
kind: "Rollout",
metadata: {
name: name,
namespace: args.namespace,
},
spec: args.rolloutSpec,
},
{ parent: this, ...opts }
);
// Apply ConfigMap revision patching if enabled
if (args.configMapPatch) {
this.setupConfigMapRevisions(name, args);
}
this.registerOutputs({
rollout: this.rollout,
});
}
private setupConfigMapRevisions(name: string, args: RolloutComponentArgs): void {
pulumi
.all([args.kubeconfig, args.configMapName])
.apply(async ([kubeconfig, configMapName]) => {
try {
// Create Server-Side Apply enabled provider
const ssaProvider = new k8s.Provider(`${name}-ssa-provider`, {
kubeconfig: JSON.stringify(kubeconfig),
enableServerSideApply: true,
});
// Wait for rollout to stabilize and create ReplicaSets
await this.waitForRolloutStabilization();
// Get ReplicaSets associated with this rollout
const replicaSets = await this.getAssociatedReplicaSets(
args.namespace,
configMapName,
kubeconfig
);
if (replicaSets.length === 0) {
pulumi.log.warn("No ReplicaSets found for ConfigMap patching");
return;
}
// Create owner references for the ConfigMap
const ownerReferences = replicaSets.map(rs => ({
apiVersion: "apps/v1",
kind: "ReplicaSet",
name: rs.metadata?.name!,
uid: rs.metadata?.uid!,
controller: false,
blockOwnerDeletion: false,
}));
// Patch the ConfigMap with owner references
new k8s.core.v1.ConfigMapPatch(
`${configMapName}-revision-patch`,
{
metadata: {
name: configMapName,
namespace: args.namespace,
ownerReferences: ownerReferences,
annotations: {
"pulumi.com/patchForce": "true",
"configmap.kubernetes.io/revision-managed": "true",
},
},
},
{
provider: ssaProvider,
retainOnDelete: true,
parent: this,
}
);
pulumi.log.info(
`Successfully patched ConfigMap ${configMapName} with ${ownerReferences.length} owner references`
);
} catch (error) {
pulumi.log.error(`Failed to setup ConfigMap revisions: ${error}`);
throw error;
}
});
}
private async waitForRolloutStabilization(): Promisevoid> {
// Wait for rollout to create and stabilize ReplicaSets
// In production, consider using a more sophisticated polling mechanism
await new Promise(resolve => setTimeout(resolve, 10000));
}
private async getAssociatedReplicaSets(
namespace: string,
configMapName: string,
kubeconfig: any
): Promisek8sClient.V1ReplicaSet[]> {
const kc = new k8sClient.KubeConfig();
kc.loadFromString(JSON.stringify(kubeconfig));
const appsV1Api = kc.makeApiClient(k8sClient.AppsV1Api);
try {
const response = await appsV1Api.listNamespacedReplicaSet(
namespace,
undefined, // pretty
false, // allowWatchBookmarks
undefined, // continue
undefined, // fieldSelector
`configMap=${configMapName}` // labelSelector
);
return response.body.items;
} catch (error) {
pulumi.log.error(`Failed to list ReplicaSets: ${error}`);
return [];
}
}
}
Rollout Creation: When a new rollout is created, Argo Rollouts generates new ReplicaSets for the canary deployment
ConfigMap Patching: Our code waits for the ReplicaSet creation, then patches the ConfigMap with owner references pointing to these ReplicaSets
Garbage Collection: Kubernetes automatically tracks the relationship between ConfigMaps and ReplicaSets
Automatic Cleanup: When ReplicaSets are cleaned up (based on the default 10 revision history), their associated ConfigMaps are also garbage collected
Revision Control: ConfigMaps now have revision-like behavior tied to ReplicaSet history
Automatic Cleanup: No manual intervention needed for ConfigMap cleanup
Canary Safety: Old ConfigMaps remain available during canary deployments until ReplicaSets are cleaned up
Spot Instance Resilience: Pods that get replaced during canary deployments can still access their original ConfigMaps
interface RolloutComponentArgs {
namespace: string;
configMapPatch?: boolean;
kubeconfig: pulumi.Outputany>;
configMapName: pulumi.Outputstring>;
rolloutSpec: k8s.types.input.apiextensions.CustomResourceArgs["spec"];
}
To enable this feature in your rollout:
import * as pulumi from "@pulumi/pulumi";
import * as k8s from "@pulumi/kubernetes";
// Create your EKS cluster
const cluster = new k8s.Provider("k8s-provider", {
kubeconfig: clusterKubeconfig,
});
// Create ConfigMap
const appConfig = new k8s.core.v1.ConfigMap("app-config", {
metadata: {
name: "my-app-config",
namespace: "default",
labels: {
app: "my-app",
configMap: "my-app-config", // Important for ReplicaSet selection
},
},
data: {
"app.properties": "key=value\nother=setting",
},
}, { provider: cluster });
// Create rollout with ConfigMap revision management
const rollout = new ConfigMapRevisionRollout("my-app", {
namespace: "default",
configMapPatch: true,
kubeconfig: clusterKubeconfig,
configMapName: appConfig.metadata.name,
rolloutSpec: {
replicas: 3,
selector: {
matchLabels: { app: "my-app" },
},
template: {
metadata: {
labels: { app: "my-app" },
},
spec: {
containers: [{
name: "app",
image: "nginx:latest",
volumeMounts: [{
name: "config",
mountPath: "/etc/config",
}],
}],
volumes: [{
name: "config",
configMap: {
name: appConfig.metadata.name,
},
}],
},
},
strategy: {
canary: {
maxSurge: 1,
maxUnavailable: 0,
steps: [
{ setWeight: 20 },
{ pause: { duration: "1m" } },
{ setWeight: 50 },
{ pause: { duration: "2m" } },
],
},
},
},
});
The solution uses several key packages:
@pulumi/kubernetes: For Kubernetes resources and ConfigMapPatch
@kubernetes/client-node: For direct Kubernetes API access
Argo Rollouts CRDs installed in your cluster
This pattern is a good fit if you:
Use Argo Rollouts with canary deployments
Rely on ConfigMaps that must remain available across rollout revisions
Run workloads on Spot or preemptible instances
Need automatic cleanup without custom controllers
This approach gives us ConfigMap revision functionality that doesn’t exist natively in Kubernetes or Pulumi. By leveraging Kubernetes’ garbage collection mechanism and Pulumi’s patching capabilities, we created a robust solution for managing ConfigMap lifecycles during canary deployments.
The solution is particularly valuable when:
Running canary deployments with Argo Rollouts
Using AWS Spot instances that can be replaced during deployments
Needing automatic cleanup of old ConfigMaps without manual intervention
Wanting to maintain configuration availability for older pods during deployment transitions
This pattern can be extended to other scenarios where you need revision control for Kubernetes resources that don’t natively support it.
Want to learn how to put these practices into action? Meet us at KubeCon Europe 2026 (Booth 784) or register for our upcoming Zero to Production in Kubernetes workshop.
Today we’re introducing an improvement that can speed up operations by up to 20x. At every operation, and at every step within an operation, pulumi saves a snapshot of your cloud infrastructure. This gives pulumi a current view of state even if something fails mid-operation, but it comes with a performance penalty for large stacks. Here’s how we fixed it.
Before getting into the more technical details, here are a number of benchmarks demonstrating what this new experience looks like. To run the benchmarks we picked a couple of Pulumi projects: one that can be set up massively parallel, which is the worst case scenario for the old snapshot system, and another that looks a little more like a real world example. Note that we conducted all of these benchmarks in Europe, connecting to Pulumi Cloud, which runs in AWS’s us-west-2 region, so exact numbers may vary based on your location and internet connection. This should however give a good indication of the performance improvements.
We’re benchmarking two somewhat large stacks, both of which are or were used at Pulumi. The first program sets up a website using AWS bucket objects. We’re using the aws-ts-static-website example here with a small subset of the fraction from our docs site. This means we’re setting up more than 3000 bucket objects, with 3222 resources in total.
The benchmarks were measured using the time built-in command and using the best time in a best-of-three benchmark. The network traffic was measured using tcpdump, limiting the measured traffic to only the IP addresses for Pulumi Cloud. Finally, tshark was used to process the packet captures and count the bytes sent.
All the benchmarks are run with journaling off (the default experience) and with journaling on (the new experience). To begin with, let’s look at the results when creating our stack from scratch:
Time Bytes sent
Without journaling 58m26s 16.5MB
With journaling 02m50s 2.3MB
Now let’s have a look at what this looks like if we only change half the resources, but the remaining ones remain unchanged:
Time Bytes sent
Without journaling 34m49s 13.8MB
With journaling 01m45s 2.3MB
The second example is setting up an instance of the Pulumi app and API. Here we’ll have an example that’s a bit more dominated by the cost of setting up the actual infrastructure in the cloud, but we still have a very noticeable improvement in the time it takes to set up the stack.
Time Bytes sent
Without journaling 17m52s 18.5MB
With journaling 9m12s 5.9MB
**
To use this feature, you need a pulumi version newer than v3.211.0, and set the PULUMI_ENABLE_JOURNALING environment variable to true.
If you are interested in the more technical details read on!
pulumi keeps track of all resources in a stack in a snapshot. This snapshot is stored in the stack’s configured backend, which is either the Pulumi Cloud or a DIY backend. Future operations on the stack then use this snapshot to figure out which resources need to be created, updated or deleted.
pulumi creates a new snapshot at the beginning and at the end of each resource operation to minimize the possibility of untracked changes even if a deployment is aborted unexpectedly (for example due to network issues, power outages, or bugs).
At the beginning of the operation, pulumi adds a new “pending operation” to the snapshot. Pending operations declare the intent to mutate a resource. If a pending operation is left in the snapshot (in other words the operation started, but pulumi couldn’t record the end of it), the next operation will try to resolve this. If we have an ID for the resource already, for example on partial updates/deletes, pulumi will try to read the resource state from the cloud and resolve it internally. If there is no ID yet, pulumi will ask the user to check the actual state of the resource. Depending on the user’s response, pulumi will either remove the operation from the snapshot or import the resource. This is because it is possible that the resource has been set up correctly or that the resource creation failed. If pulumi aborted midway through the operation, it’s impossible to know which state the resource is in.
Once an operation finishes, the pending operation is removed and the resource’s final state is recorded in the snapshot.
There’s also some additional metadata that is stored in the snapshot that is only updated infrequently.
Here’s how the snapshot looks in code. This snapshot is serialized and sent to the backend. Resources holds the list of known resource states and is updated after each operation finishes, and PendingOperations is the list of pending operations described above.
type Snapshot struct {
Manifest Manifest // a deployment manifest of versions, checksums, and so on.
SecretsManager secrets.Manager // the secrets manager to use when serializing this snapshot.
Resources []*resource.State // all resources and their associated states.
PendingOperations []resource.Operation // all currently pending resource operations.
Metadata SnapshotMetadata // metadata associated with the snapshot.
}
Before we dive in deeper, we also need to understand a little bit about how the pulumi engine works internally. Whenever a pulumi operation is run (e.g. pulumi up, pulumi destroy, pulumi refresh etc.), the engine internally generates and executes a series of steps, to create, update, delete etc. resources. To maintain correct relationships between resources, the steps need to be executed in a partial order such that no step is executed until all of the steps it depends on have executed successfully. Steps may otherwise execute concurrently.
As each step is responsible for updating a single resource, we can generate a snapshot of the state before each step starts, and after it completes. Before each step starts, we create a pending operation, and add it to the PendingOperations list. After that step completes, we remove the pending operation from that list, and update the Resources list, either adding a resource, removing it, or updating it, depending on the kind of operation we just executed.
After this introduction, we can dive into what’s slow, how we fixed it, and some benchmarks.
To make sure the state is always as up-to-date as possible, even if there are any network hiccups/power outages etc., a step won’t start until the snapshot that includes the pending operation is confirmed to be stored in the backend. Similarly an operation won’t be considered finished until the snapshot with an updated resources list is confirmed to be stored in the backend.
To send the current state to the backend, we simply serialize it as a JSON file, and send it to the backend. However, as mentioned above, steps can be executed in parallel. If we uploaded the snapshot at the beginning and end of every step with no serialization, there would be a risk that we overwrite a new snapshot with an older one, leading to incorrect data.
Our workaround for that is to serialize the snapshot uploads, uploading one snapshot at a time. This gives us the data integrity properties we want, however it can slow step execution down, especially on internet connections with lower bandwidth, and/or high latency.
This impacts performance especially for large stacks, as we upload the whole snapshot every time, which can take some time if the snapshot is getting big. For the Pulumi Cloud backend we improved on this a little at the end of 2022. We implemented a diff based protocol, which is especially helpful for large snapshots, as we only need to send the diff between the old and the new snapshot, and Pulumi Cloud can then reconstruct the full snapshot based on that. This reduces the amount of data that needs to be transferred, thus improving performance.
However, the snapshotting is still a major bottleneck for large pulumi operations. Having to serially upload the snapshot twice for each step does still have a big impact on performance, especially if many resources are modified in parallel. Furthermore, the time spent performing textual diffs between snapshots scales in proportion to the size of the data being processed, which adds additional execution time to each operation.
As long as pulumi can complete its operation, there’s no need for the intermediate checkpoints. We could allow pulumi operations to skip uploading the intermittent checkpoints to the backend. This, of course, avoids the single serialization point we have sending the snapshots to the backend, and thus makes the operation much more performant.
However, it also has the serious disadvantage of compromising some of the data integrity guarantees pulumi gives you. If anything goes wrong during the update, pulumi has no notion of what happened until then, potentially leaving orphaned resources in the provider, or leaving resources in the state that no longer exist.
Neither of these solutions is very satisfying, as the tradeoff is either performance or data integrity. We would like to have our cake and eat it too, and that’s exactly what we’re doing with journaling.
To achieve this, we went back to the drawing board, and asked ourselves, “What would a solution look like that’s both performant and preserves data integrity throughout the update?”.
Making that happen is possible because of three facts:
We always start with the same snapshot on the backend and the CLI.
Every step the engine executes affects only one resource.
We have a service that can reconstruct a snapshot from what is given to it.
(The third point here already hints at it, but this feature is only available and made possible by Pulumi Cloud, not on the DIY backend).
What if instead of sending the whole snapshot, or a diff of the snapshot, we could send the individual changes to the base snapshot to the service, which could then apply it, and reconstruct a full snapshot from it? This is exactly what we are doing here, in the form of what we call journal entries. Each journal entry has the following form:
const (
JournalEntryKindBegin JournalEntryKind = 0
JournalEntryKindSuccess JournalEntryKind = 1
JournalEntryKindFailure JournalEntryKind = 2
JournalEntryKindRefreshSuccess JournalEntryKind = 3
JournalEntryKindOutputs JournalEntryKind = 4
JournalEntryKindWrite JournalEntryKind = 5
JournalEntryKindSecretsManager JournalEntryKind = 6
JournalEntryKindRebuiltBaseState JournalEntryKind = 7
)
type JournalEntry struct {
// Version of the journal entry format.
Version int `json:"version"`
// Kind of journal entry.
Kind JournalEntryKind `json:"kind"`
// Sequence ID of the operation.
SequenceID int64 `json:"sequenceID"`
// ID of the operation this journal entry is associated with.
OperationID int64 `json:"operationID"`
// ID for the delete Operation that this journal entry is associated with.
RemoveOld *int64 `json:"removeOld"`
// ID for the delete Operation that this journal entry is associated with.
RemoveNew *int64 `json:"removeNew"`
// PendingReplacementOld is the index of the resource that's to be marked as pending replacement
PendingReplacementOld *int64 `json:"pendingReplacementOld,omitempty"`
// PendingReplacementNew is the operation ID of the new resource to be marked as pending replacement
PendingReplacementNew *int64 `json:"pendingReplacementNew,omitempty"`
// DeleteOld is the index of the resource that's to be marked as deleted.
DeleteOld *int64 `json:"deleteOld,omitempty"`
// DeleteNew is the operation ID of the new resource to be marked as deleted.
DeleteNew *int64 `json:"deleteNew,omitempty"`
// The resource state associated with this journal entry.
State *ResourceV3 `json:"state,omitempty"`
// The operation associated with this journal entry, if any.
Operation *OperationV2 `json:"operation,omitempty"`
// If true, this journal entry is part of a refresh operation.
RebuildDependencies bool `json:"isRefresh,omitempty"`
// The secrets manager associated with this journal entry, if any.
SecretsProvider *SecretsProvidersV1 `json:"secretsProvider,omitempty"`
// NewSnapshot is the new snapshot that this journal entry is associated with.
NewSnapshot *DeploymentV3 `json:"newSnapshot,omitempty"`
}
These journal entries encode all the information needed to reconstruct the snapshot from them. Each journal entry can be sent in parallel from the engine, and the snapshot will still be fully valid. All journal entries have a Sequence ID attached to them, and they need to be replayed in that order on the service side to make sure we get a valid snapshot. It is however okay to replay without journal entries that have not yet been received by the service, and whose sequence ID is thus missing. This is safe because the engine only sends entries in parallel for resources whose parents/dependencies have been fully created and confirmed by the service.
This way we make sure that the resources list is always in the correct partial order that is required by the engine to function correctly, and for the snapshot to be considered valid.
The algorithm looks as follows:
# Apply snapshot writes. This replaces the full snapshot we have on the service.
# We do this if default providers change, because we don't emit steps for that, as
# we do for the rest of the operations.
snapshot = find_write_journal_entry_or_use_base(base, journal)
# Track changes
deletes, snapshot_deletes, mark_deleted, mark_pending = set(), set(), set(), set()
operation_id_to_resource_index = {}
# Process journal entries. This is the main algorithm, that adds new resources
# to the snapshot, removes existing ones, deals with refreshes, and operations
# that update outputs.
incomplete_ops = {}
has_refresh = false
index = 0
for entry in journal:
match entry.type:
case BEGIN:
incomplete_ops[entry.op_id] = entry
case SUCCESS:
del incomplete_ops[entry.op_id]
if entry.state and entry.op_id:
resources.append(entry.state)
operation_id_to_resource_index.add(entry.op_id, index)
index++
if entry.remove_old:
snapshot_deletes.add(entry.remove_old)
if entry.remove_new:
deletes[remove_new] = true
if entry.pending_replacement:
mark_pending(entry.pending_replacement)
if entry.delete:
mark_deleted(entry.delete)
has_refresh |= entry.is_refresh
case REFRESH_SUCCESS:
del incomplete_ops[entry.op_id]
has_refresh = true
if entry.remove_old:
if entry.state:
snapshot_replacements[entry.remove_old] = entry.state
else:
snapshot_deletes.add(entry.remove_old)
if entry.remove_new:
if entry.state:
deletes[entry.remove_new] = true
else:
resources.replace(operation_id_to_resource_index(entry.remove_new), entry.state)
case FAILURE:
del incomplete_ops[entry.op_id]
case OUTPUTS:
if entry.state and entry.remove_old:
snapshot_replacements[entry.remove_old] = entry.state
if entry.state and entry.remove_new:
resources.replace(operation_id_to_resource_index(entry.remove_new), entry.state)
deletes = deletes.map(|i| => operation_id_to_resource_index[i])
# Now that we have marked all the operations, and created a new list of resources, we can
# go through them, and merge the list of new resources and old resources from the snapshot
# that remain together.
for i, res in resources:
if i in deletes:
remove_from_resources(resources, i)
# Merge snapshot resources. These resources have not been touched by the update, and will
# thus be appended to the end of the resource list. We also need to mark existing resources as
# `Delete` and `PendingReplacement` here.
for i, res in enumerate(snapshot.resources):
if i not in snapshot_deletes:
if i in snapshot_replacements:
resources.append(snapshot_replacements[i])
else:
if i in mark_deleted:
res.delete = true
if i in mark_pending:
res.pending_replacement = true
resources.append(res)
# Collect pending operations. These are stored separately from the resources list
# in the snapshot.
pending_ops = [op.operation for op in incomplete_ops.values() if op.operation]
pending_ops.extend([op for op in snapshot.pending_ops if op.type == CREATE])
# Rebuild dependencies if necessary. Refreshes can delete parents or dependencies
# of resources, without affecting the resource itself directly. We need to now remove
# these relationships to make sure the snapshot remains valid.
if has_refresh:
rebuild_dependencies(resources)
The full documentation of the algorithm can be found in our developer docs.
pulumi state is a very central part of pulumi, so we wanted to be extra careful with the rollout to make sure we don’t break anything. We did this in a few stages:
We implemented the replay interface inside the pulumi CLI, and ran it in parallel with the current snapshotting implementation in our tests. The snapshots were then compared automatically, and tests made to fail when the result didn’t match.
Since tests can’t cover all possible edge cases, the next step was to run the journaler in parallel with the current snapshotting implementation internally. This was still without sending the results to the service. However we would compare the snapshot, and send an error event to the service if the snapshot didn’t match. In our data warehouse we could then inspect any mismatches, and fix them. Since this does involve the service in a minor way, we would only do this if the user is using the Cloud backend.
Next up was adding a feature flag for the service, so journaling could be turned on selectively for some orgs. At the same time we implemented an opt-in environment variable in the CLI (PULUMI_ENABLE_JOURNALING), so the feature could be selectively turned on by users, if both the feature flag is enabled and the user sets the environment variable. This way we could slowly start enabling this in our repos, e.g. first in the integration tests for pulumi/pulumi, then in the tests for pulumi/examples and pulumi/templates, etc.
Allow users to start opting in. If you want to opt-in with your org, please reach out to us, either on the Community Slack, or through our Support channels, and we’ll opt your org into the feature flag. Then you can begin seeing the performance improvements by setting the PULUMI_ENABLE_JOURNALING env variable to true.
Turn on the feature flag for everyone, but still require the PULUMI_ENABLE_JOURNALING env variable to be set to true. (We are here right now).
Flip the feature on by default, but still allow users to opt out using a PULUMI_DISABLE_JOURNALING env variable.
While these performance improvements hopefully make your day to day use of pulumi quicker and more enjoyable, we’re not quite done here. We’re looking at some other performance improvements, that will hopefully speed up your workflows even more.
We’re excited to announce the Stash resource, a new built-in Pulumi resource that lets you save arbitrary values directly to your stack’s state. Whether you need to capture a computed result, record who first deployed your infrastructure, or persist configuration that should remain stable across updates, Stash provides a simpler and more ergonomic solution.
Infrastructure code often produces values that need to persist beyond a single deployment. Maybe you’re generating a random identifier that should stay consistent, tracking which team member initially set up a stack, or recording a timestamp from the first deployment. Previously, you’d need workarounds like external storage, custom resources, or careful state manipulation.
The Stash resource helps with that. It takes an input value, stores it in your stack’s state, and makes it available as an output property. The output property preserves the original value even when the input changes in subsequent deployments, making Stash perfect for “first-run” scenarios where you want to capture and preserve a value from the initial deployment.
Creating a Stash is straightforward. Here’s how you’d capture the username of whoever first deploys the stack (using Node.js):
import * as pulumi from "@pulumi/pulumi";
import * as os from "os"; // Node.js built-in module
const firstDeployer = new pulumi.Stash("firstDeployer", {
input: os.userInfo().username,
});
export const originalDeployer = firstDeployer.output;
The first time this runs, both input and output will show the current user. On subsequent deployments by different users, input will update to show the new user, but output will continue returning the original deployer’s name.
Stash supports any value type—strings, numbers, objects, arrays, and nested structures. It also respects secret annotations, so if you stash a secret value, it stays encrypted in your state.
Since Stash preserves the original value by design, updating the stored value requires a replacement. You have several options:
Use --target-replace during pulumi up:
pulumi up --target-replace urn:pulumi:dev::my-project::pulumi:index:Stash::firstDeployer
Run pulumi state taint to mark the resource for replacement:
pulumi state taint urn:pulumi:dev::my-project::pulumi:index:Stash::firstDeployer
Use the replacementTrigger resource option to automate replacements based on value changes:
import * as pulumi from "@pulumi/pulumi";
import * as os from "os";
const remoteConfig = fetch("https://example.com/my-service").then(response => response.json())
const myStash = new pulumi.Stash("myStash",
{
input: os.userInfo().username,
}, {
replacementTrigger: remoteConfig.someValue,
}
);
export const stashedValue = myStash.output;
With replacementTrigger, when remoteConfig.someValue changes, the Stash resource will be replaced and the new input value will be captured.
The Stash resource is available now in Pulumi v3.208.0 and later across all supported languages. Check out the Stash documentation for detailed examples in TypeScript, Python, Go, C#, and YAML.
We’d love to hear how you’re using Stash and any feedback you have on it! Share your use cases in our Community Slack or open an issue if you see any on pulumi/pulumi.
Happy hacking!
The upcoming retirement of ingress-nginx in early 2026 gives infrastructure teams both a deadline and an opportunity to rethink traffic management. Configuring the Ingress API often meant relying on controller-specific annotations that varied between implementations. The Gateway API offers a cleaner, standardized alternative. This post investigates the practical reality of this migration and explores why kgateway emerges as a robust solution for the future.
With ingress-nginx entering its sunset phase in early 2026, the Kubernetes community faces a decision point. While the controller served as the default standard for a decade, its architecture now struggles to meet modern requirements. The transition to the Gateway API offers a chance to adopt a standard designed for contemporary traffic patterns.
The Gateway API addresses the portability issues of its predecessor by establishing a standardized, expressive approach that behaves consistently across implementations. A technical evaluation of the available options points to kgateway as a particularly strong candidate for production workloads.
The Kubernetes SIG Network and Security Response Committee was direct in its announcement regarding the project’s future. After March 2026, ingress-nginx will cease to receive releases, bug fixes, or security patches. The repositories will become read-only, leaving existing deployments functional but unmaintained.
This decision stems from both resource constraints and technical debt. The project relied on a very small group of maintainers working primarily in their spare time. Furthermore, features that once provided flexibility, such as “snippets” for arbitrary NGINX configuration injection, now pose significant security liabilities. With no path to modernize the codebase, retirement was inevitable.
It is worth noting that the Ingress API itself remains supported. The retirement affects only the ingress-nginx controller. NGINX as a web server continues unchanged. However, clusters relying on this specific controller must prepare for a transition.
To identify affected resources, checking the cluster for specific pods can reveal the scope of the dependency:
kubectl get pods --all-namespaces --selector app.kubernetes.io/name=ingress-nginx
The Gateway API represents a fundamental shift in traffic management concepts. Some elements will look familiar to Ingress users, but the underlying philosophy addresses the “baggage” that the previous standard accumulated.
The core improvement lies in expressiveness. Advanced routing requirements (header matching, weighted traffic splitting, and traffic policies) are now native parts of the specification. This eliminates the need for the non-portable annotations that plagued the Ingress ecosystem.
The API also introduces a role-oriented structure that mirrors actual organizational workflows.
GatewayClass resources are managed by infrastructure teams to define the underlying controller.
Gateway resources are created by platform teams to specify entry points and listeners.
HTTPRoute resources are owned by application teams to define service traffic rules.
This separation allows for genuine self-service. Application teams can manage their routing logic without requiring broad cluster privileges.
The relationship between these resources defines the traffic flow:
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: kgateway
spec:
controllerName: kgateway.dev/kgateway
A Gateway references this class to establish the entry point:
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: my-gateway
namespace: default
spec:
gatewayClassName: kgateway
listeners:
- name: http
port: 8080
protocol: HTTP
allowedRoutes:
namespaces:
from: All
An HTTPRoute then attaches to the gateway:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: my-route
namespace: default
spec:
parentRefs:
- name: my-gateway
hostnames:
- "example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /api
backendRefs:
- name: api-service
port: 8080
Beyond standard HTTP, the API supports GRPCRoute, TCPRoute, UDPRoute, and TLSRoute, offering a protocol diversity that the original Ingress spec lacked.
The Gateway API enables a security model where routes can cross namespace boundaries, but only with explicit permission. The ReferenceGrant resource controls these connections. Without a grant, an HTTPRoute in one namespace cannot reference a Service in another.
apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
name: allow-routes-from-apps
namespace: gateway-system
spec:
from:
- group: gateway.networking.k8s.io
kind: HTTPRoute
namespace: my-app
to:
- group: ""
kind: Service
This mechanism allows platform teams to strictly control which namespaces can attach to shared gateways.
Several implementations have emerged to support the new standard, each with a distinct philosophy.
NGINX Gateway Fabric leverages NGINX as the data plane, offering continuity for teams in the F5 ecosystem. It provides impressive throughput and integrates with enterprise security tools, making it a logical choice for existing NGINX shops.
Traefik focuses on simplicity. As a single binary with no external dependencies, it aligns well with environments that prioritize developer experience and rapid iteration. Its declarative configuration is particularly friendly to GitOps workflows.
Envoy Gateway is built on the Envoy Proxy and operates under a community-driven governance model. It appeals to those seeking strong conformance to the Gateway API spec without vendor lock-in. Its companion project, Envoy AI Gateway, adds support for LLM routing.
kgateway, formerly Gloo Gateway, brings seven years of production history to the table. Donated to the CNCF in early 2025, it combines an Envoy data plane with a control plane optimized for scale. Its internal architecture uses the “krt” framework to handle massive route tables efficiently, avoiding the performance bottlenecks often seen in snapshot-based systems.
Aspect NGINX Gateway Fabric Traefik Envoy Gateway kgateway
Base Technology NGINX (C) Custom Go Envoy Proxy (C++) Envoy Proxy (C++)
Maturity Production-ready Mature GA (v1.0+) Battle-tested (7+ years as Gloo)
Setup Complexity Medium Low Low Low
Commercial Support F5 Enterprise Traefik Labs Multiple vendors Solo.io
AI/LLM Support No No Via Envoy AI Gateway Native (Agentgateway)
Learning Curve Familiar for NGINX users Gentle Moderate Moderate
Community Size Growing Large Large Established
Best For Teams with NGINX expertise, high-throughput requirements Teams prioritizing simplicity and rapid deployment Standards-focused teams, multi-vendor environments Enterprise scale, AI workloads, Istio integration
NGINX Gateway Fabric suits teams needing F5 support or deep NGINX tuning.
Traefik fits best where simplicity and operational ease are paramount.
Envoy Gateway works for those valuing community governance and strict standards adherence.
kgateway is the choice for enterprise scale, native AI gateway needs, and Istio integration.
This exploration uses kgateway for its proven track record and native handling of both traditional microservices and AI traffic.
The official migration guide outlines the transition process. It’s less translation, more restructuring.
The process typically involves:
Defining a Gateway resource: Unlike Ingress, listeners must be explicitly defined. This provides granular control over ports and protocols.
Creating HTTPRoute resources: These replace Ingress routing rules. You can split complex Ingress resources into multiple HTTPRoutes for better management.
Configuring filters: Annotations for headers or redirects are replaced by standardized filters within the HTTPRoute spec.
The ingress2gateway tool can automate much of the initial conversion, providing a baseline of resources to review and refine.
Scope the migration effort first:
1. Assessment Identify the volume and complexity of existing resources.
# Count your Ingress resources across all namespaces
kubectl get ingress -A | wc -l
# List all Ingress resources with their hosts
kubectl get ingress -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: {.spec.rules[*].host}{"\n"}{end}'
2. Parallel Operations Running kgateway alongside ingress-nginx allows for incremental migration. Services can be moved one by one, validating the new configuration without risking the entire cluster’s traffic.
3. Rollback Strategy Keeping the original Ingress manifests ensures that traffic can be quickly reverted to the old controller if issues arise during the transition.
kgateway distinguishes itself through its maturity and architectural decisions. Having started as Gloo Gateway in 2018, it has hardened through years of production use.
Its control plane uses the krt framework, originally developed for Istio. This allows it to track dependencies precisely and recalculate only the parts of the configuration that have changed. In large clusters where pods churn constantly, this incremental approach prevents the control plane from becoming a bottleneck, enabling it to handle over 10,000 routes efficiently.
For teams managing AI workloads, kgateway includes the Agentgateway component. This Rust-based data plane is built for LLM traffic, supporting the Model Context Protocol (MCP) and providing token-based rate limiting. It unifies the management of AI and traditional traffic under a single control plane.
Additional features include:
Route Delegation: Allows large routing tables to be split across teams with clear inheritance.
Security Integration: Seamless mTLS with Istio ambient mesh and external authorization support.
Traffic Management: Advanced capabilities like traffic mirroring and session affinity are available without complex Envoy configuration.
Seeing these concepts in code clarifies the architecture. The following Pulumi program demonstrates a complete setup: a DigitalOcean Kubernetes cluster, the kgateway installation via Helm, Gateway API resources, and an httpbin application to validate the configuration.
The program provisions infrastructure in a specific sequence. The cluster comes first, followed by the Gateway API CRDs, then kgateway itself. Once the control plane is running, the program creates a GatewayClass, a Gateway listener, and an HTTPRoute that directs traffic to the sample application. A ReferenceGrant permits the cross-namespace reference between the HTTPRoute and the backend Service.
import * as pulumi from "@pulumi/pulumi";
import * as digitalocean from "@pulumi/digitalocean";
import * as k8s from "@pulumi/kubernetes";
// Configuration
const config = new pulumi.Config();
const clusterName = config.get("clusterName") || "kgateway-demo-cluster";
const region = config.get("region") || "fra1";
const nodeSize = config.get("nodeSize") || "s-2vcpu-4gb";
const nodeCount = config.getNumber("nodeCount") || 2;
const k8sVersion = config.get("k8sVersion") || "1.32.10-do.2";
// Create a DigitalOcean Kubernetes cluster
const cluster = new digitalocean.KubernetesCluster(clusterName, {
name: clusterName,
region: region,
version: k8sVersion,
nodePool: {
name: "default-pool",
size: nodeSize,
nodeCount: nodeCount,
},
});
// Create a Kubernetes provider using the cluster's kubeconfig
const k8sProvider = new k8s.Provider("k8s-provider", {
kubeconfig: cluster.kubeConfigs[0].rawConfig,
});
// Install Gateway API CRDs (standard channel)
const gatewayApiCrds = new k8s.yaml.ConfigFile("gateway-api-crds", {
file: "https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/standard-install.yaml",
}, { provider: k8sProvider });
// Create kgateway-system namespace
const kgatewayNamespace = new k8s.core.v1.Namespace("kgateway-system", {
metadata: {
name: "kgateway-system",
},
}, { provider: k8sProvider });
// Install kgateway CRDs via Helm
const kgatewayCrds = new k8s.helm.v3.Release("kgateway-crds", {
chart: "oci://cr.kgateway.dev/kgateway-dev/charts/kgateway-crds",
version: "v2.1.2",
namespace: kgatewayNamespace.metadata.name,
createNamespace: false,
}, { provider: k8sProvider, dependsOn: [gatewayApiCrds, kgatewayNamespace] });
// Install kgateway control plane via Helm
const kgateway = new k8s.helm.v3.Release("kgateway", {
chart: "oci://cr.kgateway.dev/kgateway-dev/charts/kgateway",
version: "v2.1.2",
namespace: kgatewayNamespace.metadata.name,
createNamespace: false,
}, { provider: k8sProvider, dependsOn: [kgatewayCrds] });
// Create hello-world namespace
const helloWorldNamespace = new k8s.core.v1.Namespace("hello-world", {
metadata: {
name: "hello-world",
},
}, { provider: k8sProvider });
// Deploy hello world application (using httpbin as shown in kgateway docs)
const helloWorldLabels = { app: "httpbin" };
const helloWorldDeployment = new k8s.apps.v1.Deployment("httpbin", {
metadata: {
name: "httpbin",
namespace: helloWorldNamespace.metadata.name,
},
spec: {
replicas: 1,
selector: {
matchLabels: helloWorldLabels,
},
template: {
metadata: {
labels: helloWorldLabels,
},
spec: {
containers: [{
name: "httpbin",
image: "mccutchen/go-httpbin:v2.15.0",
ports: [{
containerPort: 8080,
name: "http",
}],
}],
},
},
},
}, { provider: k8sProvider, dependsOn: [helloWorldNamespace] });
const helloWorldService = new k8s.core.v1.Service("httpbin", {
metadata: {
name: "httpbin",
namespace: helloWorldNamespace.metadata.name,
},
spec: {
selector: helloWorldLabels,
ports: [{
port: 8000,
targetPort: 8080,
name: "http",
}],
},
}, { provider: k8sProvider, dependsOn: [helloWorldDeployment] });
// Create GatewayClass for kgateway
const gatewayClass = new k8s.apiextensions.CustomResource("kgateway-gateway-class", {
apiVersion: "gateway.networking.k8s.io/v1",
kind: "GatewayClass",
metadata: {
name: "kgateway",
},
spec: {
controllerName: "kgateway.dev/kgateway",
},
}, { provider: k8sProvider, dependsOn: [kgateway] });
// Create Gateway with LoadBalancer service (matching kgateway docs pattern)
const gateway = new k8s.apiextensions.CustomResource("http-gateway", {
apiVersion: "gateway.networking.k8s.io/v1",
kind: "Gateway",
metadata: {
name: "http",
namespace: kgatewayNamespace.metadata.name,
},
spec: {
gatewayClassName: "kgateway",
listeners: [{
name: "http",
port: 8080,
protocol: "HTTP",
allowedRoutes: {
namespaces: {
from: "All",
},
},
}],
},
}, { provider: k8sProvider, dependsOn: [gatewayClass] });
// Create HTTPRoute for httpbin application
const httpRoute = new k8s.apiextensions.CustomResource("httpbin-route", {
apiVersion: "gateway.networking.k8s.io/v1",
kind: "HTTPRoute",
metadata: {
name: "httpbin",
namespace: helloWorldNamespace.metadata.name,
},
spec: {
parentRefs: [{
name: "http",
namespace: kgatewayNamespace.metadata.name,
}],
hostnames: [
"*.nip.io",
],
rules: [{
backendRefs: [{
name: "httpbin",
port: 8000,
}],
}],
},
}, { provider: k8sProvider, dependsOn: [gateway, helloWorldService] });
// Create a reference grant to allow the HTTPRoute to reference the service cross-namespace
const referenceGrant = new k8s.apiextensions.CustomResource("httpbin-reference-grant", {
apiVersion: "gateway.networking.k8s.io/v1beta1",
kind: "ReferenceGrant",
metadata: {
name: "allow-gateway-to-httpbin",
namespace: helloWorldNamespace.metadata.name,
},
spec: {
from: [{
group: "gateway.networking.k8s.io",
kind: "HTTPRoute",
namespace: helloWorldNamespace.metadata.name,
}],
to: [{
group: "",
kind: "Service",
}],
},
}, { provider: k8sProvider, dependsOn: [gatewayApiCrds] });
// Create a LoadBalancer Service for the Gateway proxy
const gatewayProxyService = new k8s.core.v1.Service("gateway-proxy", {
metadata: {
name: "gateway-proxy",
namespace: kgatewayNamespace.metadata.name,
},
spec: {
type: "LoadBalancer",
selector: {
"gateway.networking.k8s.io/gateway-name": "http",
},
ports: [{
name: "http",
port: 8080,
targetPort: 8080,
protocol: "TCP",
}],
},
}, { provider: k8sProvider, dependsOn: [gateway] });
// Extract the LoadBalancer IP from our managed Service
const gatewayIP = gatewayProxyService.status.apply(status => {
const ingress = status?.loadBalancer?.ingress;
if (ingress && ingress.length > 0) {
return ingress[0].ip || ingress[0].hostname || "pending";
}
return "pending";
});
// Exports
export const clusterNameOutput = cluster.name;
export const clusterEndpoint = cluster.endpoint;
export const kubeconfig = pulumi.secret(cluster.kubeConfigs[0].rawConfig);
// Export the LoadBalancer IP and nip.io URL
export const gatewayLoadBalancerIP = gatewayIP;
export const httpbinUrl = gatewayIP.apply(ip =>
ip !== "pending" ? `http://httpbin.${ip}.nip.io:8080/get` : "Waiting for LoadBalancer IP..."
);
**
The demo uses nip.io, a wildcard DNS service that resolves *.IP.nip.io hostnames to the specified IP address. This eliminates the need to configure separate DNS records during testing. Once deployed, the httpbinUrl output provides a ready-to-use endpoint for validating the gateway configuration.
For organizations that cannot immediately migrate to the Gateway API, there is another path forward. Chainguard has forked ingress-nginx as part of their EmeritOSS program, providing continued maintenance after the upstream project’s retirement.
This fork is not about adding new features. Chainguard is explicit that they are maintaining stability, not continuing development. Their commitment includes keeping dependencies updated, addressing CVEs on a best-efforts basis, and providing commercial container images with low vulnerability counts and SLAs. FIPS-compliant versions are also available for regulated environments.
The chainguard-forks/ingress-nginx repository on GitHub provides the maintained codebase. For teams running ingress-nginx in production, this fork offers a viable bridge while evaluating the Gateway API or other alternatives. However, this buys time rather than solving the problem permanently. The architectural limitations of the Ingress API remain, and the eventual move to a more expressive standard like the Gateway API is still the recommended path forward.
The retirement of ingress-nginx marks the end of an era and the beginning of a more structured approach to Kubernetes networking. The Gateway API offers a robust framework that reflects how teams actually build and secure applications today.
While March 2026 provides a comfortable runway, early planning prevents rushed decisions. Setting up a test environment with kgateway to validate the new API constructs is a prudent next step.
Moving to the Gateway API means building a networking foundation ready for the next decade of infrastructure challenges.
Use Pulumi's open-source SDK to create, deploy, and manage infrastructure on any cloud.
Gateway API Documentation – Comprehensive guides and examples.
Migration Guide – Official steps for transitioning from Ingress.
ingress2gateway – Tooling to automate resource conversion.
kgateway Documentation – Installation and configuration details.
Designing kgateway for Scalability – Deep dive into the krt framework.
Kubernetes Blog: ingress-nginx Retirement – Full context on the deprecation.
The era of AI-accelerated development has arrived, creating both unprecedented opportunity and unprecedented challenge. Developers ship code faster than ever, but platform teams struggle to keep pace. The velocity gap threatens to become a bottleneck.
As 2025 comes to a close, let’s look back at how we addressed this challenge.
This year, we took a giant leap forward to close that gap with several major innovations, including purpose-built AI for platform engineers, next-generation policy management that transforms governance into an accelerator, and the foundation for building Internal Developer Platforms that enable self-service without sacrificing control.
AI-Assisted Development Everywhere
Next-Generation Policy Management: AI-Powered Governance at Scale
Internal Developer Platform: Self-Service Infrastructure at Scale
Secrets Management: Taming Sprawl at Scale
Identity and Access Management
Pulumi IaC
Cloud Provider Support
Language Support
Infrastructure Operations
Kubernetes Operator
The Year Ahead
We launched Pulumi Neo, purpose-built AI specifically designed for platform engineering challenges. Neo fundamentally changed how platform engineers work.
Video
The problem Neo solves is critical: while AI coding assistants have accelerated developers substantially, platform teams have struggled to keep pace. Every line of code that ships faster creates new platform needs - monitoring, secrets, pipelines, compliance checks. The velocity gap between development and platform teams was widening, threatening to become a bottleneck that would slow down entire organizations.
Neo levels the playing field by giving platform engineers their own dedicated AI tool. Unlike generic AI coding assistants that lack infrastructure context, Neo deeply understands cloud environments, infrastructure as code, secrets management, and the unique challenges platform teams face. It speaks their language and works within their constraints.
What makes Neo different is how it’s built. Neo operates on top of Pulumi’s existing platform capabilities - the same IaC, ESC, and policy features you already use for governance become Neo’s operational guardrails. It automatically respects your security and policy rules, works within your governance frameworks, and maintains the audit trails and compliance controls your organization requires. This isn’t experimental AI retrofitted with infrastructure plugins - it’s enterprise-ready intelligence built from the ground up on proven foundations.
Throughout the year, we enhanced Neo with capabilities that scale your team’s expertise. Custom Instructions and Slash Commands let you encode organizational standards once and Neo applies them automatically, while turning proven prompts into reusable shortcuts anyone can use. Operating Modes give you flexible control - from full review to autonomous execution. And full Pulumi CLI integration means Neo can handle complete infrastructure workflows, from stack operations to cloud CLI commands.
Beyond Neo, we brought AI assistance directly into development workflows throughout the year. The Pulumi Model Context Protocol (MCP) Server connects AI-powered code assistants with Pulumi’s CLI and registry, enabling real-time resource information without leaving your editor. Use it with GitHub Copilot, Anthropic’s Claude Code, and Cursor to accelerate resource discovery and write infrastructure code faster. We enhanced this with remote MCP server support for centralized management and CLI AI extensions for intelligent command-line assistance.
With Neo giving platform teams AI superpowers, we needed to ensure AI-accelerated development remained safe and compliant. The challenge isn’t detecting security issues - it’s fixing them at scale. Most policy tools stop at detection, leaving teams with overwhelming backlogs and no scalable way to remediate violations.
We ended that compromise with the next generation of Pulumi Policies. This comprehensive governance solution moves beyond detection to deliver AI-powered remediation through a two-step lifecycle:
Video
Get Clean: We introduced the Policy Findings Hub that gives every stakeholder their needed view - leadership sees compliance scores, auditors get control-centric compliance views, and platform teams get a collaborative workspace to triage and track remediation. Combined with audit scans, you get instant compliance baselines without blocking developers. Neo integrates directly into the Policy Findings hub to automate the fix itself, generating pull requests with exact code changes needed. For unmanaged resources, Neo even generates code to import the resource into a Pulumi stack and apply the fix.
Stay Clean: We launched pre-built compliance packs for CIS, NIST, PCI, and HITRUST across AWS, Azure, and Google Cloud. These out-of-the-box policy packs let you enforce industry-standard compliance frameworks immediately without writing custom policies. They act as universal guardrails by blocking non-compliant changes during deployment. Neo can even author new custom policies - ask Neo to “create a policy that prevents overly permissive IAM roles,” and it generates the code. This is how we make AI safe to go fast.
Video
Platform engineering teams face a persistent challenge: they’re constantly responding to infrastructure requests instead of building the systems that enable self-service at scale. This reactive cycle prevents platform teams from doing their most valuable work - establishing patterns, codifying standards, and creating golden paths that empower developers without sacrificing control.
We delivered the foundation to break this cycle with Pulumi IDP. Pulumi IDP provides everything platform teams need to build world-class internal developer platforms. Codify organizational standards into reusable building blocks. Give developers simple, approachable interfaces that enforce best practices automatically. Transform platform engineering from reactive ticket-taking to strategic enablement.
The key is Pulumi Private Registry, your organization’s source of truth for golden paths and platform building blocks. Private Registry provides streamlined publishing workflows and simplified discovery, making it easy to share infrastructure abstractions across your organization. Combined with our expanded public registry (now over 150 providers and 7,500 resource types), you have comprehensive options for managing infrastructure across your entire cloud estate.
What makes this powerful is next-generation Pulumi Components with true cross-language support. Author infrastructure abstractions once in your preferred language, then consume them in any supported Pulumi language - including YAML for non-programmers, and soon, HCL. Components include built-in input validation, detailed documentation, and improved error messages. This means platform teams can reach every developer in their organization regardless of language preference, scaling their expertise without scaling their team.
The proliferation of secrets across modern cloud environments creates massive security risks. Long-lived credentials pose significant vulnerabilities. Secrets scattered across systems make it nearly impossible to track usage or enforce consistent policies. And manual rotation processes are error-prone and rarely done.
2025 saw significant expansion of Pulumi ESC to address these challenges across your entire secrets ecosystem:
Automated Rotation: We launched ESC Rotated Secrets, automatically rotating credentials like AWS IAM access keys on flexible schedules. This eliminates manual rotation effort and significantly reduces vulnerability windows. We expanded this with database credential rotation for PostgreSQL and MySQL, including support for databases in private VPCs via AWS VPC Lambda connectors.
Universal Integration: We integrated ESC with the secrets management platforms you’re already using:
Snowflake - Dynamic OIDC tokens for temporary authentication plus automated RSA keypair rotation
Infisical - Dynamic OIDC login and secret fetching from the open-source secrets platform
Doppler - OIDC access tokens and centralized secret fetching
For systems without native support, we launched ESC Connect, enabling you to build simple HTTPS adapters that integrate any custom or proprietary secret source with ESC.
These integrations let you maintain existing secrets infrastructure while gaining ESC’s centralized management, audit capabilities, and governance controls. You don’t have to rip and replace - ESC works with what you have.
Better Experience: We improved ESC usability throughout the year with streamlined onboarding, approval workflows for sensitive environments, and deletion protection to prevent accidents.
Security & Trust: For organizations with strict compliance needs like HIPAA or GDPR, we launched Customer-Managed Keys (BYOK). This allows you to bring your own encryption keys (via AWS KMS) to encrypt secrets stored in ESC, giving you full control over key lifecycles and revocation.
Modern security demands unwavering trust in your security posture. How do you empower teams to deploy rapidly without opening doors to risk or violating compliance mandates?
We launched Pulumi IAM, embedding robust, granular security directly into your cloud development lifecycle. Pulumi IAM provides the unified framework for fine-grained authorization needed to confidently manage modern cloud infrastructure:
Custom Roles: Define reusable permissions with fine-grained scopes tailored to your organization’s specific needs
Least Privilege Enforcement: Control precisely who can do what on which specific resources, minimizing the impact if credentials are compromised
Secure Automation: Generate scoped access tokens for CI/CD pipelines with only the necessary permissions, eliminating over-privileged service accounts
Zero Trust Foundation: Verify every access request and grant minimum necessary access, implementing true Zero Trust principles
This foundational capability enables true least-privilege for CI/CD pipelines, reduces blast radius if tokens are compromised, and provides the compliance evidence auditors require. Platform and security teams finally have the fine-grained control needed to scale Pulumi usage securely across enterprise organizations without sacrificing velocity.
Platform teams need IaC tools that keep pace with rapidly evolving cloud platforms while providing operational flexibility and reliability. This year we shipped hundreds of enhancements to Pulumi’s core IaC capabilities, from major cloud provider updates to new operational primitives that give teams more control over infrastructure lifecycles.
Managing multi-cloud infrastructure requires comprehensive provider support that keeps pace with rapidly evolving cloud platforms. This year we shipped major provider updates:
AWS Provider 7.0 brought game-changing improvements: manage resources across multiple AWS regions with a single provider instance, enhanced IAM role chaining with better error handling, and simplified S3 resource management.
Azure Native V3 delivered a 75% reduction in SDK size while maintaining 100% Azure ARM API coverage. Faster downloads, more manageable package sizes, and improved reliability.
Google Cloud Provider 9.0 brought updated API versions, new modules for AI and Google Gemini, and expanded resource support for the latest GCP services.
Direct Terraform Module Support was one of our most requested features. Execute Terraform modules directly in Pulumi without conversion, providing a seamless migration path for module-heavy projects.
Java SDK 1.0 GA provides first-class Java support with feature parity to other Pulumi languages, support for all current LTS Java versions, and complete Automation API support.
Resource Hooks was one of our most requested features. Run custom code at any point in a resource’s lifecycle - before creation, after updates, before deletion. This unlocks scenarios like validation checks before deployment, triggering external systems when infrastructure changes, and custom logging and auditing.
We also shipped dependent resource replacements, excluding specific targets from stack operations, state taint capabilities, improved refresh and destroy experience for short-lived credentials, and CLI control through environment variables.
The Pulumi Kubernetes Operator 2.0 reached GA with a completely rewritten, faster codebase featuring enhanced reconciliation logic, better error handling, and automatic retry for temporary failures. We continued enhancing it with version 2.3.0 adding preview mode for validating infrastructure changes and structured configuration support for GitOps workflows.
2025 was the year we gave platform engineers the tools to thrive in the age of AI-accelerated development. We ended the impossible choice between velocity and control. We transformed governance from a blocker into an accelerator. We enabled self-service without sacrificing governance.
But this is just the beginning. The foundation we built this year sets the stage for even bigger innovations ahead. Neo will get smarter as it learns from infrastructure patterns across the community. Policy management will expand to more compliance frameworks. IDP will enable more sophisticated self-service workflows. ESC will integrate with more platforms.
The future of platform engineering is strategic, proactive, and AI-enabled. Platform teams won’t be bottlenecks - they’ll be the strategic enablers who make sustainable velocity possible.
Thank you for being part of the Pulumi community. Your feedback drives everything we build. We can’t wait to show you what comes next.
Want to try these features? Check out our documentation, join our community Slack, or contact us to discuss how Pulumi can transform your platform engineering practice.
Here’s to an incredible 2025, and an even better 2026!
We work with thousands of customers who prefer Pulumi due to our modern approach to infrastructure that delivers faster time to market with built-in security and compliance. Yet we know many organizations have years of investments into tools like Terraform. At the same time, HashiCorp customers are increasingly telling us about their frustrations post-IBM acquisition: rate increases, loss of open source heritage, overnight rug-pull of CDKTF, … and the hits just keep on coming. Today, we’re excited to announce three new ways Pulumi is enabling customers of HashiCorp, an IBM Company, who want a better, open source friendly, modern solution for their IaC to choose Pulumi. First, Pulumi Cloud will support Terraform and OpenTofu, so you can continue using any Terraform or Pulumi CLI and language with the complete Pulumi Cloud product, including our infrastructure engineering AI agent, Neo. Second, Pulumi’s own open source IaC tool will support HCL natively as one of its many languages, alongside the industry’s best languages including Python, TypeScript, Go, C#, Java, and YAML. Pulumi is multi-language at its core and many organizations are diverse and polyglot—these new capabilities truly make Pulumi the most universal IaC platform with the broadest support. Third, we’re offering flexible financing to make it easy to depart HashiCorp for Pulumi.
Pulumi Cloud now manages Terraform/OpenTofu with full visibility, governance, and agentic AI included. Pulumi IaC now speaks HCL alongside general purpose languages and YAML. And we’ll cover your costs until your HashiCorp contract ends.
Pulumi has always been two things: Pulumi IaC, the multi-language, open source infrastructure as code technology, and Pulumi Cloud, our commercial platform that includes infrastructure state management and visibility, our AI agent Neo, secrets management, policy as code, governance, and much more. You can think of it like Git (the tool) versus GitHub (the SaaS).
Historically, we required that you choose Pulumi IaC to benefit from Pulumi Cloud but over time we’ve been moving away from that, such as with our AI and governance features: you point us at any cloud accounts, and we’ll help you make sense of and tame them, regardless of how the accounts’ resources were deployed (Pulumi, CDK, Terraform, … even manual point and click). Today, we are taking that one step further and letting you run Terraform, OpenTofu, or Pulumi IaC as your tool of choice, while still benefitting from Pulumi Cloud’s full suite of capabilities.
The great thing about this is that even if you choose Terraform or OpenTofu IaC on the client, you will still benefit from all of the capabilities of our server, from AI to governance to visibility and more. Terraform statefiles and workspaces will show up effectively the same way Pulumi stacks do, and you’ll get full visibility of who is changing what and when including diffs and logs.
Why would we do such a thing? As we’ve worked with larger and larger companies in our journey to thousands of customers, we’ve seen that there’s significant Terraform out there in the world. Even if a team’s long-term objective is to migrate to 100% Pulumi – reaping the many benefits of modern IaC, like faster time to market by catering better to a polyglot world of developers, infrastructure experts, security engineers, and AI/ML teams – that transition doesn’t happen overnight. Many teams legitimately want a mix of IaC tools. Ensuring all infrastructure is fully automated, secured, and managed is a more righteous outcome to focus on rather than debating one’s choice of IaC tool or language. There are many paths you can take, and now Pulumi can be your one platform to stay on top of all of it and drive towards this outcome.
Support for Terraform/OpenTofu state is in private beta and we are beginning to work with customers directly as we get it ready for prime time. We anticipate general availability in Q1 2026.
At Pulumi, we love our languages. We now support six – depending on how you count: Python, TypeScript, Go, any .NET language (like C#), and any JVM language (like Java itself), and even YAML. Having this broad array of languages is a massive unlock: you suddenly get access to the full ecosystem of tooling and expertise around these languages, including rich syntax (for loops, if statements, functions), IDEs, testing frameworks, true sharing and reuse, and ensuring that LLMs deeply understand your IaC. This choice of language is then married with the best of declarative IaC, so you still get the belts and suspenders safety of a desired state IaC tool.
But we work with customers all the time where some of the team is more comfortable with and/or genuinely prefers HCL. The HCL language was purpose-built for IaC through Terraform and, now, OpenTofu and is easy for simple use cases. We actually shipped YAML support two years ago because it’s an industry standard and we kept hearing about simpler use cases where you didn’t need a full blown language (especially e.g. when code-generating IaC or supporting simple developer self-service CI/CD pipelines where a handful of lines of YAML do the trick). But despite that, there’s a ton of muscle memory with HCL in the IaC community.
We are not dogmatic about languages, we love all of them. The L in HCL and YAML stands for “language” and we’ve always had a “come one, come all” mindset. As soon as we see enough market demand for a given language, we will add it. Well, that time has come for HCL.
The good news is that this is not a bolt on. Just like any of the other Pulumi languages, you have full access to the entirety of the Pulumi ecosystem, including thousands of providers. Thanks to our Terraform bridge, if there’s a Terraform provider out there, it just works. This is explicitly not meant to be a lift-and-shift migration option – we have many other techniques for that, including the new Terraform support in Pulumi Cloud – but is instead for teammates who are more comfortable with or prefer HCL over general-purpose languages, but still want to leverage the more modern Pulumi IaC engine with its advanced multi-language capabilities.
HCL support also integrates with Pulumi’s multi-language technology in a deep way, so that you can author modules in one language and consume them from another. This will let, for example, platform teams author complex components in, say, Go – with the rich facilities offered by the language – and then expose them to teammates who consume them in HCL (or vice versa!)
Similar to Terraform support in Pulumi Cloud, HCL is currently in private beta and we will work with customers directly to ensure it meets our quality standards, with a goal of general availability in Q1 2026.
These two new technical capabilities augment an existing array of tools that make it easy for you to choose Pulumi as your IaC platform of choice, even in organizations with hybrid client-side IaC tools, and languages may be general-purpose, HCL, or a mixture thereof.
Pulumi IaC supports using any Terraform provider. Many of them are available pre-built in the Pulumi Registry, however, in the event one isn’t pre-published, you can generate one on demand using the “Any Terraform Provider”. Pulumi doesn’t use the Terraform engine, but rather it leverages the resource schemas and create, read, update, and delete methods in them.
We have numerous migration tools. The first is Neo, our infrastructure engineering agent, who has Terraform-specific skills to migrate code and state. Neo leverages a number of building blocks that you can use directly instead. That includes the pulumi convert --from terraform command which understands how to convert Terraform/OpenTofu HCL or CDKTF code into a Pulumi language of your choosing, preserving code structure including migrating modules to Pulumi components. We also support directly importing resources from your cloud accounts directly, either at the CLI or visually in Pulumi Cloud. Read more about migration here.
You can deploy Terraform modules straight off the shelf from your favorite Pulumi language. That works as well for any supported Pulumi language, including HCL, and it shows up as though the underlying module resources were managed natively in Pulumi. This ensures if your team has a collection of battle tested best practices encoded as Terraform modules, and either plan to keep some Terraform or even migrate eventually, you don’t need to migrate right away.
Finally, you can reference Terraform state outputs from Pulumi programs and configurations. Using this you could, for example, keep your network defined in Terraform and define a layer of higher level infrastructure that consumes VPC and subnet IDs from that lower-layer network workspace. This can be useful if there’s short- or long-term coexistence between Terraform and Pulumi IaC programs, such as during an ongoing migration effort or in a hybrid team.
Even with the new technical capabilities above, we realize many customers already have HashiCorp contracts that may have them locked into a single vendor. The last thing we want is for you to have to pay for two IaC solutions for a period of time. Additionally, we want to ensure that the transition is as smooth as possible and your team is set up for long-term success.
To make it easier to say yes to Pulumi, we are offering three things:
Escape hatch for your current contract. We know paying for two IaC solutions at once is a non-starter, so we’re letting you apply credits purchased from HashiCorp towards your Pulumi usage until your next renewal, avoiding double pay.
Free IaC modernization workshop. Our professional services cloud architects will host a free IaC modernization workshop to review where you’re at with your IaC already and share best practices of how to adopt the Pulumi platform at scale we’ve learned from working with world-class organizations like BMW, NVIDIA, and Supabase. You will leave this session trained up and equipped to succeed with the next phase of your IaC journey.
Return on investment (ROI) calculation. We will show you how the move to Pulumi will not only be spend-neutral thanks to the escape hatch, but how much value and savings you should expect to see, given our experience helping innovators like Snowflake accelerate their time to market – going from code to cloud in weeks to hours.
These ensure there’s no financial penalty for switching, a very clear ROI, and no learning curve. We have always been proud to work with customers of all sizes in all industries, so these offers are available to you whether you’re a Global 2000, startup, or somewhere in between.
We’ve worked with thousands of companies over 8+ years to build what we view as the most innovative infrastructure as code platform on the market. We’re biased, of course, but our customers range from innovative startups, to established public technology companies, to Global 2000, and everywhere in between, and they tell us this all of the time.
For example:
Snowflake radically improved time to market on their path to IPO. “When we demonstrated to people that what used to take a week and a half now, with Pulumi, took under a day, they were shocked.”
Lemonade switched from Terraform to Pulumi so they could embed business logic into infrastructure, share and reuse logic, and scale their lean ops team to support a much larger group of developers. “We’re not limited to one-size-fits-all configurations, but can actually implement environment-specific customizations for our infrastructure.”
BMW was able to establish a center of infrastructure excellence that they call CodeCraft, standardizing all infrastructure delivery, and scaling to support 10,000+ developers.
Supabase was able to scale to meet the heightened demands and pace of AI, saying that “the infrastructure team acts as groundkeepers of our Pulumi practices, not gatekeepers, but promoters for the entire org."
Wiz manages over 1 million resources, tens of thousands Kubernetes clusters, and hundreds of data centers, with over 100,000 daily updates.
Here’s why:
Language choice. Especially with the addition of Terraform/OpenTofu support, and HCL language support natively in Pulumi IaC, the Pulumi platform truly is the most universal IaC platform available. It democratizes access to infrastructure across the entire organization which often has very varied and diverse skillsets. By embracing general-purpose languages, it allows engineering teams to apply rigor to how they manage infrastructure, improving productivity – translating directly to time to market – robustness, and standards compliance.
AI native. We’ve been infusing AI into our platform for over three years now, culminating in the release of our infrastructure engineering agent, Neo. Neo is like Claude Code for your infrastructure and allows you to automate short- and long-horizon infrastructure tasks, like spinning up and scaling infrastructure, upgrading clusters, getting compliant, reducing cloud waste, and so much more. Neo works over any infrastructure no matter how it was provisioned.
Standard enterprise capabilities and controls. Regardless of language or IaC tool choice, you still get one central standardized set of capabilities and controls. This ensures you have one “mission control” from which to standardize, secure, and govern your entire cloud estate, regardless of what client-side tool choice (or even in the presence of click-ops).
Full visibility of what’s happening, when, where, and why. The Pulumi platform always gives you total visibility into your cloud estate so you can quickly understand what is going on. This helps you wrap your hands around your cloud and chart a course to 100% IaC, the table stakes nirvana many organizations are trying to get to, and the choice of language is one small part.
The final point isn’t even a technical one, but we hear it from our customers all the time:
Customer love as our #1 company value. Our first company value is “when the customer is successful, we are too” – and we live it and breathe it daily. I just visited a new customer to review how the project is going yesterday and their head of cloud infrastructure and EVP of product were in the room and both went out of their way to express gratitude for how our team really showed up to help get them on the right track. We don’t view customers as a transaction, we view it as a lifetime partnership, and treat our customers with the respect they deserve. We won’t sleep at night until you’re on the cloud path that you envisioned when selecting Pulumi.
To learn more about our product capabilities, visit these pages:
If you’d like to join the private beta waitlist for the new Terraform/OpenTofu and HCL capabilities, or take advantage of the financial flexibility options, please get in touch.
We will work with you closely on a three step process to adopt Pulumi: First, the modernization workshop; then a rapid proof of value; and finally an adoption plan that avoids you double paying Pulumi and HashiCorp.
If you’d like to try Pulumi on your own immediately, you can sign up for Pulumi Cloud here, or go through our open source IaC getting started tutorial here.
We can’t wait to earn the right to work together.
In July, 2020, CDK for Terraform (CDKTF) was introduced, and last week, on December 10, it was officially deprecated. Support for CDKTF has stopped, the organization and repository have been archived, and HashiCorp/IBM will no longer be updating or maintaining it, leaving a lot of teams out there without a clear path forward.
For most teams, that means it’s time to start looking for a replacement.
It’s an unfortunate situation to suddenly find yourself in as a user of CDKTF, but you do have options, and Pulumi is one of them. In this post, we’ll help you understand what those options are, how Pulumi fits into them, and what it’d look like to migrate your CDKTF projects to Pulumi.
Teams migrating away from CDKTF generally have three options:
HashiCorp’s official recommendation is to export your projects to HashiCorp Configuration Language (HCL) and manage them with Terraform. CDKTF even has a command that makes this fairly simple:
cdktf synth --hcl
Of course, if you’re using CDKTF, you probably chose it specifically to avoid HCL. So while possible, this probably isn’t the choice most teams would make unless they had to.
If your team is all-in on AWS, another option would be to migrate to AWS CDK. It’s widely used, officially supported, the programming model is similar to CDKTF’s, and both CDK and CDKTF transpile to an intermediate format (CloudFormation YAML and Terraform JSON, respectively) that gets passed on to their underlying tools for deployment.
But while their programming and deployment models are conceptually similar, their resource models and APIs are entirely different. Here’s the code for an S3 bucket written in AWS CDK, for example:
import * as s3 from 'aws-cdk-lib/aws-s3';
const bucket = new s3.Bucket(this, 'my-bucket', {
bucketName: 'my-example-bucket',
versioned: true,
publicReadAccess: false,
});
And here’s the code for a similarly configured bucket in CDKTF:
import { S3Bucket } from '@cdktf/provider-aws/lib/s3-bucket';
const bucket = new S3Bucket(this, 'my-bucket', {
bucket: 'my-example-bucket',
versioning: {
enabled: true,
},
acl: 'private',
});
Notice how different these APIs are — and this is just one simple resource with only a few properties; imagine having to rewrite dozens or hundreds of them. Beyond that, there’s also the problem of state: How would you go about translating the contents of a Terraform state file containing hundreds of resources into the equivalent CloudFormation YAML or JSON?
Despite their surface similarities, CDKTF and AWS CDK have little in common. Migration would essentially mean a ground-up rewrite that’d also leave you without the multi-cloud support you already have with CDKTF. For most teams, that makes this option a practical non-starter.
This is where we should acknowledge our obvious bias — but we genuinely believe that for most users of CDKTF, Pulumi really is the simplest and most broadly compatible alternative.
Like CDKTF, Pulumi lets you build and manage your infrastructure with general-purpose languages like TypeScript, Python, Go, C#, and Java, and it supports organizing your code into higher-level abstractions called components, which you can think of like CDKTF constructs. Both organize cloud resources into stacks (think dev, prod), and both track deployment state similarly, with local, remote, and cloud-hosted options available.
Many of Pulumi’s most popular providers (e.g., the AWS provider) are also built from open-source Terraform schemas, which means their resource models will be nearly identical to what you’re used to with CDKTF. Here’s what an S3 bucket looks like in Pulumi, for example:
import * as aws from '@pulumi/aws';
const bucket = new aws.s3.Bucket('my-bucket', {
bucket: 'my-example-bucket',
versioning: {
enabled: true,
},
acl: 'private',
});
You can also use any Terraform provider with Pulumi, and you can even reference Terraform modules directly from within your Pulumi code.
Pulumi is also different from CDKTF in several ways. One is that rather than transpile your source code to a format like JSON as CDKTF does (and then deploying it separately later), Pulumi uses its own declarative deployment engine that resolves the resource graph at runtime and provisions cloud resources directly, which is much faster and more flexible. You can learn more about the deployment model in How Pulumi Works.
Given the API similarities, the support for all Terraform providers and modules, the ability to coexist alongside Terraform-managed projects, and the built-in support for conversion (which we’ll cover next), we think Pulumi is the best alternative for most teams looking to migrate.
Migrating a CDKTF project to Pulumi generally happens in three steps:
Conversion, which translates your CDKTF code into a new Pulumi program
Import, which reads the contents of your CDKTF state into a new Pulumi stack
Refactoring, which brings the code in the new program into alignment with the stack’s currently deployed resources
Migration starts with exporting your CDKTF project to HCL with cdktf synth. From there, Pulumi’s built-in convert and import commands handle creating the new program and importing your state:
# Export your project to HCL.
cdktf synth --hcl
# Convert the HCL into a new Pulumi project.
pulumi convert --from terraform --language typescript
# Create a new Pulumi stack.
pulumi stack init dev
# Import your CDKTF stack's resources into your new Pulumi stack.
pulumi import --from terraform ./terraform.dev.tfstate
The converter automatically translates Terraform input variables, data sources, resources, and outputs into their Pulumi equivalents. You can read more about how this works in Converting Terraform HCL to Pulumi.
Once you’ve imported your state, you’ll often have to make some adjustments to the code to bring it in line with the new Pulumi stack. For instance, pulumi import marks new resources protected by default, to prevent them from being accidentally deleted — but since the code produced by pulumi convert doesn’t include the protect resource option, you’ll need to add it yourself. Fortunately the import step also emits code that you can copy into your program to make this process a little easier.
Refactoring can get a bit more complicated when custom logic and higher-level abstractions are involved, as fidelity to the original CDKTF code is often lost in the translation to HCL. In these situations, having the help of an LLM to recapture that original logic or translate your CDKTF constructs into Pulumi components can be a big time-saver.
The best way to get a feel for how this works, though, is to try it yourself.
The pulumi/cdktf-to-pulumi-example repository on GitHub contains a CDKTF project with multiple stacks written in TypeScript, along with a guide that walks you through the process of migrating that project to Pulumi. The guide covers everything we’ve discussed here so far, including:
Converting the CDKTF project into a new Pulumi project
Importing its actively running resources into Pulumi stacks
Modifying the generated code to align with imported state
Performing an initial deployment with Pulumi to complete the migration process
The walkthrough takes only a few minutes to complete, and it’s a great way to stand up an example of your own to get more familiar with Pulumi.
[
** github.com/pulumi/cdktf-to-pulumi-example
](https://github.com/pulumi/cdktf-to-pulumi-example)
If you’re moving on from CDKTF and looking for an alternative, there are a few possible paths forward. For teams that want to keep using real languages and avoid a ground-up rewrite, Pulumi offers the clearest way forward.
To learn more about how Pulumi works, how it differs from CDKTF and from Terraform, how to handle additional conversion scenarios, and more, we recommend:
Diving into the Pulumi docs to get familiar with core concepts and features of the platform
Reading Migrating from Terraform or CDKTF to Pulumi for more detailed, Terraform-specific migration guidance
Joining us in the Pulumi Community Slack to ask questions and learn from others who’ve successfully made the leap from Terraform and CDKTF to Pulumi
Checking out Pulumi for All Your IaC — Including Terraform and HCL to learn more about Pulumi’s native support for Terraform and HCL
And of course, feel free to reach out! We’d love to help in any way we can.
Pulumi Cloud helps teams manage and operate their cloud infrastructure in one place, from state and secrets to deployments, visibility, and policy enforcement.
For a long time, one request has consistently come up from the Pulumi community: dark mode. Today, we’re announcing that this request is now available in Pulumi Cloud.
Pulumi Cloud supports light mode and dark mode. You can switch themes at any time from the utility bar.
Light mode remains the default experience. Dark mode uses lighter text and UI elements on a darker background, which many users prefer for extended sessions or low-light environments.
This update is enabled by recent work from our User Experience team to introduce a shared design system across Pulumi Cloud. With that foundation in place, theming can now be applied consistently across pages and features.
Dark mode allows you to opt into a theme that uses lighter text and graphics on a darker background. You can also choose to keep the current theme, now known as light mode.
You can update your theme from the utility bar at any time.
Want to try dark mode out for yourself? Sign in to your Pulumi Cloud account, or if you are new to Pulumi, create a free account.
We would love to hear your feedback. You can reach us in the Pulumi Community Slack or share requests in the public GitHub repo.
We look forward to hearing what you think of these changes!
Managing credentials in CI/CD pipelines has always involved tradeoffs. Long-lived access tokens are convenient but create security risks when they leak or fall into the wrong hands. Short-lived credentials are more secure but require additional tooling to obtain and manage. Today, we’re eliminating this tradeoff with native OIDC token exchange support in the Pulumi CLI.
The Pulumi CLI now includes built-in support for exchanging OIDC tokens from your identity provider for short-lived Pulumi Cloud access tokens. This means you can authenticate to Pulumi Cloud directly from CI/CD environments like GitHub Actions, GitLab CI, or Kubernetes without storing any long-lived Pulumi credentials as secrets.
Most CI/CD workflows authenticate to Pulumi Cloud using personal access tokens or organization tokens stored as secrets. While this approach works, it comes with significant security concerns:
Credential exposure: If a token is accidentally committed to a repository or logged in CI output, attackers gain long-term access to your infrastructure
Rotation complexity: Rotating tokens requires updating secrets across multiple CI/CD systems
Over-privileged access: Tokens often have broader permissions than needed for specific workflows
Audit trail gaps: Difficult to trace which workflow run used which credentials
With OIDC token exchange, you eliminate these risks by leveraging short-lived tokens that your CI/CD platform or identity provider already issues. No long-lived secrets to manage, rotate, or secure.
The pulumi login command now accepts OIDC tokens directly:
pulumi login --oidc-token --oidc-org
The CLI exchanges your OIDC token for a short-lived Pulumi Cloud access token, which is then used for all subsequent
operations. Tokens expire after 2 hours by default, though you can customize this with the --oidc-expiration flag.
You can scope tokens to specific teams or users:
# Scope to a team
pulumi login --oidc-token --oidc-org my-org --oidc-team platform-team
# Scope to a user
pulumi login --oidc-token --oidc-org my-org --oidc-user alice
The --oidc-token flag accepts either a raw token string or a file path prefixed with file://, making it easy to
integrate with various token delivery mechanisms.
For workloads running in Kubernetes, you can use service account tokens and exchange them for Pulumi access tokens. The following example uses a Pulumi program to define a Kubernetes Job resource.
import * as kubernetes from "@pulumi/kubernetes";
const script = new kubernetes.core.v1.ConfigMap("script", {
data: {
"entrypoint.sh": `#!/bin/bash
EKS_ID_TOKEN=$(cat /var/run/secrets/eks.amazonaws.com/serviceaccount/token)
pulumi login --oidc-token $EKS_ID_TOKEN --oidc-org MY_ORG_NAME
pulumi whoami
`
}
});
const job = new kubernetes.batch.v1.Job("runner", {
metadata: {},
spec: {
template: {
spec: {
serviceAccountName: "pulumi-service-account",
containers: [{
name: "runner",
image: "pulumi/pulumi:latest",
command: ["/bin/entrypoint.sh"],
volumeMounts: [
{
name: "script",
mountPath: "/bin/entrypoint.sh",
readOnly: true,
subPath: "entrypoint.sh",
},
],
}],
restartPolicy: "Never",
volumes: [
{
name: "script",
configMap: {
defaultMode: 0o700,
name: script.metadata.name,
},
},
],
},
},
backoffLimit: 0,
},
});
export const jobName = job.metadata.name;
This approach works with any Kubernetes cluster that supports service account token projection, including EKS, GKE, and
AKS. The example uses EKS’s default token location at /var/run/secrets/eks.amazonaws.com/serviceaccount/token, but you
can adapt the token path for other Kubernetes distributions.
Before using OIDC token exchange with the Pulumi CLI, you need to:
Register your OIDC provider as a trusted issuer in your Pulumi organization settings
Configure authorization policies that specify which tokens can be exchanged and what permissions they receive
Ensure your CI/CD system or identity provider is configured to issue OIDC tokens with the appropriate audience claim
Native OIDC token exchange is available now in the latest version of the Pulumi CLI. To get started:
Update to the latest Pulumi CLI version
Configure your OIDC provider and authorization policies in Pulumi Cloud
Update your CI/CD workflows to use pulumi login --oidc-token
For complete documentation, including setup guides for specific identity providers, see:
We’re excited to see how this feature helps you build more secure infrastructure automation workflows. If you have questions or feedback, join us in the Pulumi Community Slack.