releases.shpreview

Self-host Gemma 4 on your Mac with Pulumi and Tailscale

If you run AI tools and agents, you’ve probably accepted three tradeoffs: your data leaves your network, you can’t work offline, and your bill scales with usage.

Open-weight models now run well on consumer hardware. Once the model is on your machine, your data stays local, inference works offline, and tokens cost nothing. If you own a modern Mac, you can run a high-quality model yourself.

Gemma 4 is an open-weights model family from Google. This post focuses on Gemma 4 12 B, released in June 2026, using Unsloth’s Q8_0 GGUF. The 12 B model fits comfortably on a modern Mac while leaving enough headroom for local llama.cpp and a chat UI.

We’ll use llama.cpp for host-native inference, k3d for a local Kubernetes cluster, Pulumi for infrastructure as code, and Tailscale for secure access.

Prerequisites

This setup was validated on the following hardware:

  • macOS 26 Tahoe, version 26.5
  • MacBook Pro with Apple M3 Max
  • 36 GB RAM

On this machine, llama.cpp reported about 20 output tokens per second for a 160-token validation response with unsloth/gemma-4-12b-it-GGUF, gemma-4-12b-it-Q8_0.gguf, and a 131,072-token context. Sustained throughput varies by prompt length, thermal state, and llama.cpp settings.

You’ll need brew, docker, pulumi, and tailscale installed. We’ll also install k3d during the process.

Run Gemma 4 with host-native llama.cpp

We use llama.cpp directly on macOS to leverage Apple Metal acceleration. Running the LLM on the host is more efficient than trying to pass GPU access into a local Kubernetes VM.

Install the build tools:

<span class="line"><span class="cl">brew install cmake git
</span></span>

Then build llama.cpp from source and download the multimodal projector. In validation, Homebrew llama.cpp 9430 could run text inference, but it could not load the new Gemma 4 12 B projector and failed with unknown projector type: gemma4uv. Building current llama.cpp from source fixed that.

<span class="line"><span class="cl"><span class="nv">llm_home</span><span class="o">=</span><span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/pulumi-gemma4-llm"</span>
</span></span><span class="line"><span class="cl">mkdir -p <span class="s2">"</span><span class="nv">$llm_home</span><span class="s2">/models"</span> <span class="s2">"</span><span class="nv">$llm_home</span><span class="s2">/logs"</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">if</span> <span class="o">[</span> ! -d <span class="s2">"</span><span class="nv">$llm_home</span><span class="s2">/llama.cpp/.git"</span> <span class="o">]</span><span class="p">;</span> <span class="k">then</span>
</span></span><span class="line"><span class="cl"> git clone --depth <span class="m">1</span> https://github.com/ggml-org/llama.cpp.git <span class="s2">"</span><span class="nv">$llm_home</span><span class="s2">/llama.cpp"</span>
</span></span><span class="line"><span class="cl"><span class="k">fi</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">cmake -S <span class="s2">"</span><span class="nv">$llm_home</span><span class="s2">/llama.cpp"</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"> -B <span class="s2">"</span><span class="nv">$llm_home</span><span class="s2">/llama.cpp/build"</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"> -DGGML_METAL<span class="o">=</span>ON <span class="se">\
</span></span></span><span class="line"><span class="cl"> -DGGML_BLAS<span class="o">=</span>ON <span class="se">\
</span></span></span><span class="line"><span class="cl"> -DCMAKE_BUILD_TYPE<span class="o">=</span>Release
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">cmake --build <span class="s2">"</span><span class="nv">$llm_home</span><span class="s2">/llama.cpp/build"</span> --target llama-server -j <span class="m">10</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">curl -L --fail <span class="se">\
</span></span></span><span class="line"><span class="cl"> --output <span class="s2">"</span><span class="nv">$llm_home</span><span class="s2">/models/mmproj-F16.gguf"</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"> https://huggingface.co/unsloth/gemma-4-12b-it-GGUF/resolve/main/mmproj-F16.gguf
</span></span>

Then download and run the model with this command:

<span class="line"><span class="cl"><span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/pulumi-gemma4-llm/llama.cpp/build/bin/llama-server"</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"> --hf-repo unsloth/gemma-4-12b-it-GGUF <span class="se">\
</span></span></span><span class="line"><span class="cl"> --hf-file gemma-4-12b-it-Q8_0.gguf <span class="se">\
</span></span></span><span class="line"><span class="cl"> --mmproj <span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/pulumi-gemma4-llm/models/mmproj-F16.gguf"</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"> --host 127.0.0.1 <span class="se">\
</span></span></span><span class="line"><span class="cl"> --port <span class="m">18080</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"> --ctx-size <span class="m">131072</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"> --parallel <span class="m">1</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"> --jinja <span class="se">\
</span></span></span><span class="line"><span class="cl"> --reasoning off
</span></span>

We use port 18080 because 8080 is commonly used and is likely to conflict with another service you may already have running locally. If your port 8080 is free, you can use it and adjust the Pulumi config later.

The model file is about 12.65 GB, and the projector is about 116 MB. Gemma 4 12 B advertises a 131,072-token context, and this Mac loaded that full context with --parallel 1. llama.cpp projected about 15.1 GiB of Apple Metal device memory for the text model and about 258 MiB worst-case memory for the projector, leaving enough headroom for Open WebUI and the rest of the local stack. The --reasoning off flag keeps OpenAI-compatible chat responses visible in clients that do not read separate reasoning fields.

With --mmproj, /v1/models advertised capabilities: ["completion","multimodal"]. In local validation, Open WebUI accepted an uploaded Pulumi logo image and Gemma 4 described it correctly. A small WAV file also worked through the OpenAI-compatible input_audio request shape, though llama.cpp logs still mark audio input as experimental.

Open WebUI using local Gemma 4 12 B to describe an uploaded Pulumi logo

Verify the LLM API

Open a new terminal and check if llama.cpp is responding:

<span class="line"><span class="cl">curl http://127.0.0.1:18080/v1/models
</span></span>

The /v1/models endpoint should return the model ID unsloth/gemma-4-12b-it-GGUF. Now try a chat completion:

<span class="line"><span class="cl">curl http://127.0.0.1:18080/v1/chat/completions <span class="se">\
</span></span></span><span class="line"><span class="cl"> -H <span class="s2">"Content-Type: application/json"</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"> -d <span class="s1">'{
</span></span></span><span class="line"><span class="cl"><span class="s1"> "model": "unsloth/gemma-4-12b-it-GGUF",
</span></span></span><span class="line"><span class="cl"><span class="s1"> "messages": [{"role": "user", "content": "Reply with exactly: OK"}],
</span></span></span><span class="line"><span class="cl"><span class="s1"> "temperature": 0,
</span></span></span><span class="line"><span class="cl"><span class="s1"> "max_tokens": 32
</span></span></span><span class="line"><span class="cl"><span class="s1"> }'</span>
</span></span>

The chat prompt Reply with exactly: OK should return content OK. In validation, llama.cpp reported an output token velocity of about 20 tokens per second for a longer 160-token response.

Keep llama.cpp running after reboot

For a permanent setup, put the llama.cpp startup script and logs under a folder in your home directory and let launchd restart it when you sign in:

<span class="line"><span class="cl"><span class="nv">llm_home</span><span class="o">=</span><span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/pulumi-gemma4-llm"</span>
</span></span><span class="line"><span class="cl">mkdir -p <span class="s2">"</span><span class="nv">$llm_home</span><span class="s2">/logs"</span> <span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/Library/LaunchAgents"</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">cat > <span class="s2">"</span><span class="nv">$llm_home</span><span class="s2">/start-llama-server.sh"</span> <span class="s"><<'EOF'
</span></span></span><span class="line"><span class="cl"><span class="s">#!/bin/zsh
</span></span></span><span class="line"><span class="cl"><span class="s">set -euo pipefail
</span></span></span><span class="line"><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="cl"><span class="s">export PATH="/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin"
</span></span></span><span class="line"><span class="cl"><span class="s">
</span></span></span><span class="line"><span class="cl"><span class="s">exec "$HOME/pulumi-gemma4-llm/llama.cpp/build/bin/llama-server" \
</span></span></span><span class="line"><span class="cl"><span class="s"> --hf-repo unsloth/gemma-4-12b-it-GGUF \
</span></span></span><span class="line"><span class="cl"><span class="s"> --hf-file gemma-4-12b-it-Q8_0.gguf \
</span></span></span><span class="line"><span class="cl"><span class="s"> --mmproj "$HOME/pulumi-gemma4-llm/models/mmproj-F16.gguf" \
</span></span></span><span class="line"><span class="cl"><span class="s"> --host 127.0.0.1 \
</span></span></span><span class="line"><span class="cl"><span class="s"> --port 18080 \
</span></span></span><span class="line"><span class="cl"><span class="s"> --ctx-size 131072 \
</span></span></span><span class="line"><span class="cl"><span class="s"> --parallel 1 \
</span></span></span><span class="line"><span class="cl"><span class="s"> --jinja \
</span></span></span><span class="line"><span class="cl"><span class="s"> --reasoning off
</span></span></span><span class="line"><span class="cl"><span class="s">EOF</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">chmod +x <span class="s2">"</span><span class="nv">$llm_home</span><span class="s2">/start-llama-server.sh"</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">cat > <span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/Library/LaunchAgents/com.pulumi.gemma4.llama-server.plist"</span> <span class="s"><<EOF
</span></span></span><span class="line"><span class="cl"><span class="s"><?xml version="1.0" encoding="UTF-8"?>
</span></span></span><span class="line"><span class="cl"><span class="s"><!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
</span></span></span><span class="line"><span class="cl"><span class="s"><plist version="1.0">
</span></span></span><span class="line"><span class="cl"><span class="s"><dict>
</span></span></span><span class="line"><span class="cl"><span class="s"> <key>Label</key>
</span></span></span><span class="line"><span class="cl"><span class="s"> <string>com.pulumi.gemma4.llama-server</string>
</span></span></span><span class="line"><span class="cl"><span class="s"> <key>ProgramArguments</key>
</span></span></span><span class="line"><span class="cl"><span class="s"> <array>
</span></span></span><span class="line"><span class="cl"><span class="s"> <string>$llm_home/start-llama-server.sh</string>
</span></span></span><span class="line"><span class="cl"><span class="s"> </array>
</span></span></span><span class="line"><span class="cl"><span class="s"> <key>WorkingDirectory</key>
</span></span></span><span class="line"><span class="cl"><span class="s"> <string>$llm_home</string>
</span></span></span><span class="line"><span class="cl"><span class="s"> <key>RunAtLoad</key>
</span></span></span><span class="line"><span class="cl"><span class="s"> <true/>
</span></span></span><span class="line"><span class="cl"><span class="s"> <key>KeepAlive</key>
</span></span></span><span class="line"><span class="cl"><span class="s"> <true/>
</span></span></span><span class="line"><span class="cl"><span class="s"> <key>StandardOutPath</key>
</span></span></span><span class="line"><span class="cl"><span class="s"> <string>$llm_home/logs/llama-server.out.log</string>
</span></span></span><span class="line"><span class="cl"><span class="s"> <key>StandardErrorPath</key>
</span></span></span><span class="line"><span class="cl"><span class="s"> <string>$llm_home/logs/llama-server.err.log</string>
</span></span></span><span class="line"><span class="cl"><span class="s"></dict>
</span></span></span><span class="line"><span class="cl"><span class="s"></plist>
</span></span></span><span class="line"><span class="cl"><span class="s">EOF</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">launchctl bootout gui/<span class="k">$(</span>id -u<span class="k">)</span>/com.pulumi.gemma4.llama-server 2>/dev/null <span class="o">||</span> <span class="nb">true</span>
</span></span><span class="line"><span class="cl">launchctl bootstrap gui/<span class="k">$(</span>id -u<span class="k">)</span> <span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/Library/LaunchAgents/com.pulumi.gemma4.llama-server.plist"</span>
</span></span><span class="line"><span class="cl">launchctl kickstart -k gui/<span class="k">$(</span>id -u<span class="k">)</span>/com.pulumi.gemma4.llama-server
</span></span>

Check the launchd service and llama.cpp logs:

<span class="line"><span class="cl">launchctl print gui/<span class="k">$(</span>id -u<span class="k">)</span>/com.pulumi.gemma4.llama-server
</span></span><span class="line"><span class="cl">tail -f <span class="s2">"</span><span class="nv">$HOME</span><span class="s2">/pulumi-gemma4-llm/logs/llama-server.err.log"</span>
</span></span>

If you want to stop llama.cpp later, unload the launchd service:

<span class="line"><span class="cl">launchctl bootout gui/<span class="k">$(</span>id -u<span class="k">)</span>/com.pulumi.gemma4.llama-server
</span></span>

Deploy Open WebUI with Pulumi and k3d

Now we’ll deploy Open WebUI into a local Kubernetes cluster. This provides a polished chat interface that connects to our host-native LLM.

First, install k3d if you haven’t already:

<span class="line"><span class="cl">brew install k3d
</span></span>

Create a new cluster for this project:

<span class="line"><span class="cl">k3d cluster create pulumi-gemma4-blog-qa
</span></span>

We’ll use the Pulumi program in pulumi/examples. This program defaults to runtimeMode=host, which creates a Kubernetes ExternalName service pointing to your host machine.

Why not run the LLM inside Kubernetes on this Mac? Pulumi can do that, and the example supports it with runtimeMode=cluster, but that path is meant for Linux hosts with NVIDIA or AMD GPU device plugins.

On macOS, llama.cpp enables Metal by default, and Metal acceleration is available to native macOS processes. k3d runs Linux containers through Docker Desktop, so those pods do not get direct access to the Mac’s Metal device. Docker’s own vLLM Metal announcement calls out the same boundary: Metal-backed inference runs natively on the host because there is no Metal GPU passthrough for containers. That is why this setup keeps inference host-native and lets Pulumi manage the Kubernetes UI, service wiring, and optional Tailscale access around it.

Clone the examples repo, navigate to the program directory, and initialize a new stack:

<span class="line"><span class="cl">git clone https://github.com/pulumi/examples.git
</span></span><span class="line"><span class="cl"><span class="nb">cd</span> examples/kubernetes-py-self-host-gemma4-llm
</span></span><span class="line"><span class="cl">pulumi stack init gemma4-local
</span></span>

Configure the program to match your local setup:

<span class="line"><span class="cl">pulumi config <span class="nb">set</span> hostLlmPort <span class="m">18080</span>
</span></span><span class="line"><span class="cl">pulumi config <span class="nb">set</span> llmBaseUrl http://llm-server:18080/v1
</span></span>

The Kubernetes service named llm-server maps to host.k3d.internal. In our validation, we confirmed that a disposable k3d pod could reach the Mac’s llama.cpp API at http://llm-server:18080/v1/models after a CoreDNS restart.

<span class="line"><span class="cl">kubectl rollout restart deployment coredns -n kube-system
</span></span>

Run pulumi up to deploy Open WebUI and connect it to host-native llama.cpp:

<span class="line"><span class="cl">pulumi up
</span></span>

In our validation environment, this command successfully reached the resource synthesis phase without requiring Tailscale credentials because Tailscale exposure is opt-in.

Access Open WebUI through Tailscale

Tailscale allows you to access your private Open WebUI instance from any device on your tailnet. Note that we only expose the web interface, not the raw LLM API, to keep the system secure.

The base Open WebUI deployment works without Tailscale credentials. To expose the web UI on your tailnet, enable Tailscale resources and provide an explicit api_key or OAuth/identity token:

<span class="line"><span class="cl">pulumi config <span class="nb">set</span> enableTailscale <span class="nb">true</span>
</span></span><span class="line"><span class="cl">pulumi config <span class="nb">set</span> tailscale:apiKey YOUR_API_KEY --secret
</span></span>

Once configured, Pulumi will create a Tailscale device or proxy that routes traffic to your Open WebUI service.

Use the model with Pi

Open WebUI gives you a browser-based chat interface, but local models are also useful from coding agents. Pi is the local coding agent used for this validation; if you do not use Pi, treat this section as an example of how any OpenAI-compatible client can point at the same local endpoint. Pi can point at the same OpenAI-compatible llama.cpp endpoint and use the model running on your Mac.

For a fresh Pi config, create ~/.pi/agent/models.json with a local provider that points at the llama.cpp API:

<span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"providers"</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"local-llama"</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"baseUrl"</span><span class="p">:</span> <span class="s2">"http://127.0.0.1:18080/v1"</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"api"</span><span class="p">:</span> <span class="s2">"openai-completions"</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"apiKey"</span><span class="p">:</span> <span class="s2">"local"</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"compat"</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"supportsDeveloperRole"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"supportsReasoningEffort"</span><span class="p">:</span> <span class="kc">false</span>
</span></span><span class="line"><span class="cl"> <span class="p">},</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"models"</span><span class="p">:</span> <span class="p">[</span>
</span></span><span class="line"><span class="cl"> <span class="p">{</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"id"</span><span class="p">:</span> <span class="s2">"unsloth/gemma-4-12b-it-GGUF"</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"name"</span><span class="p">:</span> <span class="s2">"Gemma 4 12B Q8 (local llama.cpp)"</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"reasoning"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"input"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"text"</span><span class="p">],</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"contextWindow"</span><span class="p">:</span> <span class="mi">131072</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"maxTokens"</span><span class="p">:</span> <span class="mi">1024</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"cost"</span><span class="p">:</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"input"</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"output"</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"cacheRead"</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"cacheWrite"</span><span class="p">:</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl"> <span class="p">}</span>
</span></span><span class="line"><span class="cl"> <span class="p">}</span>
</span></span><span class="line"><span class="cl"> <span class="p">]</span>
</span></span><span class="line"><span class="cl"> <span class="p">}</span>
</span></span><span class="line"><span class="cl"> <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span>

Then set Pi to use that provider and model by default in ~/.pi/agent/settings.json:

<span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"defaultProvider"</span><span class="p">:</span> <span class="s2">"local-llama"</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"defaultModel"</span><span class="p">:</span> <span class="s2">"unsloth/gemma-4-12b-it-GGUF"</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"defaultThinkingLevel"</span><span class="p">:</span> <span class="s2">"off"</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"> <span class="nt">"hideThinkingBlock"</span><span class="p">:</span> <span class="kc">true</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span>

If you already have Pi configuration files, merge the local-llama provider and defaults into your existing JSON instead of replacing the files.

Pi connected to local Gemma 4 through llama.cpp

Advanced: Linux GPU in-cluster serving

If you’re running on a Linux host with an NVIDIA or AMD GPU, you can run the LLM directly inside the Kubernetes cluster. This requires the NVIDIA or ROCm device plugins.

The Pulumi program supports this through runtimeMode=cluster. In this mode, it deploys a LlmServer pod that manages the llama.cpp process within the cluster, using GPU resource requests to ensure hardware acceleration.

Cleanup

When you’re done, you can tear down the resources:

<span class="line"><span class="cl">pulumi destroy
</span></span><span class="line"><span class="cl">k3d cluster delete pulumi-gemma4-blog-qa
</span></span><span class="line"><span class="cl"><span class="c1"># Stop llama.cpp using the PID from your terminal</span>
</span></span><span class="line"><span class="cl"><span class="nb">kill</span> <PID>
</span></span>

Fetched June 4, 2026