Disclosure: This article contains Amazon affiliate links. As an Amazon Associate, AgentPlix earns from qualifying purchases. Links to Cursor may also be affiliated. All hardware recommendations reflect independent research and hands-on testing.

Local LLM Pair Programming: GPU Rig vs. MacBook Pro (Full Setup Guide)

Running a local LLM for coding is no longer a hobbyist experiment. It is a legitimate workflow used by developers who want zero-latency autocomplete, private codebases, and full control over the model. The only real question is: what hardware do you actually need? This guide walks through a complete local setup for coding on both a dedicated GPU rig and a MacBook Pro, then gives you a straight answer on which one makes more sense for your situation.

Why Local LLMs for Coding? (The Case Against Always-On Cloud)

Cloud-based coding assistants like GitHub Copilot and Cursor are excellent. But they come with tradeoffs that are starting to bother a growing number of developers:

  • Your code leaves your machine. Every autocomplete request sends context to a remote server. For proprietary codebases, that is a real compliance risk.
  • Latency is non-deterministic. Peak hours, API rate limits, and network hiccups all introduce lag that breaks flow state.
  • Monthly costs compound. At $20/month per developer, a five-person team pays $1,200/year before you factor in any API overages.
  • You cannot run the model offline. Coffee shop with spotty Wi-Fi? Traveling internationally? You are stuck.

A local setup eliminates all four of these problems. The tradeoff is upfront cost and some configuration work. This guide handles the configuration part.

đź’ˇ Who This Guide Is For
Developers who write code daily and want a pair-programming-style AI assistant that runs entirely on their own hardware. We assume you are comfortable with a terminal but have not necessarily set up a local model before.

The Two Hardware Paths: What You Are Actually Choosing Between

Before diving into setup commands, it helps to understand what each platform actually offers for LLM inference.

MacBook Pro (Apple Silicon)

Apple Silicon Macs use a unified memory architecture, meaning the CPU and GPU share the same memory pool. This is significant for LLMs because model weights need to fit in VRAM (or the equivalent) to run efficiently. On a discrete GPU, VRAM is the hard ceiling. On a MacBook Pro, the ceiling is unified memory, which scales up to 128GB on the M3 Max.

Practical upshot: a MacBook Pro M3 Max with 96GB of unified memory can run a quantized Qwen2.5-Coder 32B model entirely in memory with room to spare. That model is competitive with GPT-4o on coding tasks for most common workloads.

The Metal Performance Shaders (MPS) backend in tools like llama.cpp and Ollama allows the GPU cores on Apple Silicon to accelerate inference meaningfully. You are not getting RTX 4090-level throughput, but you are getting real GPU acceleration, not CPU-only inference.

Dedicated GPU Rig (NVIDIA)

A dedicated GPU rig uses NVIDIA CUDA, which has years of optimization for LLM inference. CUDA kernels in libraries like vllm, llama.cpp CUDA builds, and TensorRT-LLM are extremely well-tuned. An RTX 4090 with 24GB VRAM can run a quantized 32B model at roughly 3x the token-per-second throughput of an M3 Max at the same model size.

The catch: 24GB of VRAM is the ceiling for a single consumer card. Running a 70B model requires either two cards, significant quantization, or moving to a workstation-class card like the RTX 6000 Ada (48GB, ~$6,500).

A dual-RTX 4090 setup (48GB combined VRAM via tensor parallelism) is genuinely powerful, but the cost, noise, power draw, and desk footprint are real considerations.

Benchmark Reality Check: Tokens Per Second for Coding Workflows

Here is a representative performance table using Qwen2.5-Coder 32B Q4_K_M (a strong coding model that fits comfortably on both platforms):

Hardware VRAM / Unified RAM Tokens/sec (32B Q4_K_M) Max Model Size
MacBook Pro M3 Pro (36GB) 36GB unified ~9 tok/s 20B (comfortably)
MacBook Pro M3 Max (96GB) 96GB unified ~18 tok/s 32B (comfortably)
MacBook Pro M3 Max (128GB) 128GB unified ~20 tok/s 70B (quantized)
RTX 4090 (24GB VRAM) 24GB GDDR6X ~55 tok/s 32B (quantized)
Dual RTX 4090 (48GB VRAM) 48GB GDDR6X ~90 tok/s 70B (quantized)

For pair-programming use cases, 10-20 tok/s is genuinely usable. You get a full code completion or explanation in 3-8 seconds. Is it as fast as cloud? No. Is it fast enough to stay in flow? Yes, for most developers.

đź’ˇ The Flow-State Threshold
In practice, anything above ~8 tokens/second for a 32B model is fast enough for interactive coding. Below that, you start waiting and losing context. An M3 Pro with 36GB comfortably exceeds this threshold for models up to 20B.

Setting Up Your Local Coding LLM: The Fast Path (Ollama)

Ollama is the fastest and most reliable way to get a local model running for coding. It abstracts the llama.cpp backend, handles model downloads, manages a local API server, and works identically on both macOS (Apple Silicon) and Linux/Windows (CUDA). Install it once and treat it like a local API endpoint.

Step 1: Install Ollama

macOS:

brew install ollama

Or download the .dmg from the Ollama website. This installs the Ollama service and CLI.

Linux (CUDA):

curl -fsSL https://ollama.com/install.sh | sh

Ollama’s Linux installer automatically detects your NVIDIA GPU and uses the CUDA backend.

Step 2: Pull a Coding-Optimized Model

Qwen2.5-Coder is the current benchmark leader for a locally-runnable coding model. It outperforms older Code Llama variants on HumanEval and handles multi-file context well.

# For M3 Pro (36GB) or RTX 4090 with headroom for other apps
ollama pull qwen2.5-coder:14b

# For M3 Max (96GB) or RTX 4090 fully dedicated
ollama pull qwen2.5-coder:32b

# For dual RTX 4090 or M3 Max 128GB pushing limits
ollama pull qwen2.5-coder:72b

Ollama automatically pulls the Q4_K_M quantized version, which offers the best balance of size and quality for coding. You can specify quantization explicitly if you want Q8 for higher accuracy (at the cost of 2x the memory footprint).

Step 3: Verify Inference Speed

Before wiring this into your editor, confirm the model runs at a usable speed:

ollama run qwen2.5-coder:32b "Write a Python function that validates an email address using regex. Include edge case handling and docstring."

Watch the token output. If it feels responsive (8+ tok/s visually), you are good to proceed. If it is sluggish, drop to a smaller model.

Step 4: Configure Your Editor Integration

This is where “local model” becomes “local pair programmer.” You have two strong options:

Option A: Continue (VS Code Extension)

Continue is an open-source VS Code extension that connects to your Ollama server and provides GitHub Copilot-style inline completions plus a chat panel.

Install the Continue extension from the VS Code marketplace. Then open ~/.continue/config.json and point it at your local Ollama endpoint:

{
  "models": [
    {
      "title": "Qwen2.5-Coder 32B (Local)",
      "provider": "ollama",
      "model": "qwen2.5-coder:32b",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen2.5-Coder 14B (Autocomplete)",
    "provider": "ollama",
    "model": "qwen2.5-coder:14b",
    "apiBase": "http://localhost:11434"
  }
}

Using two models here is intentional: the 32B model handles chat and complex reasoning while the faster 14B model handles inline tab completions where latency matters more than depth.

Option B: Cursor with Local Model Override

Cursor supports custom model endpoints. Under Settings > Models > Add Model, you can point Cursor’s API calls to http://localhost:11434/v1 (Ollama’s OpenAI-compatible endpoint) and enter your model name. This lets you keep Cursor’s UX and keybindings while routing inference locally.

This is the best of both worlds for developers already invested in Cursor’s workflow.

GPU Rig: Linux Setup for Maximum Throughput

If you are on a dedicated Linux GPU rig, Ollama handles most of the complexity. But there are a few extra steps to get full performance.

CUDA Driver Setup

Before installing Ollama, ensure your NVIDIA drivers and CUDA toolkit are current:

# Check current driver version
nvidia-smi

# Install CUDA toolkit (Ubuntu example)
sudo apt install nvidia-cuda-toolkit

Ollama on Linux with CUDA requires CUDA 11.8 or higher. CUDA 12.x is preferred for RTX 40-series cards.

Enable GPU Layers (Automatic in Ollama)

Ollama on Linux with a detected GPU automatically offloads all model layers to VRAM. You can verify this with:

ollama run qwen2.5-coder:32b ""
# Then in another terminal:
ollama ps

The ollama ps command shows active models and their GPU memory usage. You should see your model fully loaded into VRAM with 100% GPU offload.

Running Ollama as a Server for Team Access

One advantage of a GPU rig that a MacBook cannot replicate: you can host it as a shared inference server for a small team. Set OLLAMA_HOST=0.0.0.0 to bind to all interfaces:

OLLAMA_HOST=0.0.0.0 ollama serve

Each developer on your network points their Continue or Cursor config to your server’s local IP instead of localhost:11434. One GPU rig with an RTX 4090 can comfortably serve 3-5 concurrent developers at usable speeds.

MacBook Pro Specific Optimizations

Apple Silicon inference has a few levers worth pulling.

Use the Metal Backend Correctly

Ollama on macOS uses llama.cpp with Metal acceleration automatically. You do not need to configure this. But you can verify it is active:

ollama run qwen2.5-coder:32b "" 2>&1 | grep -i metal

You should see Metal-related initialization output.

Manage Memory Pressure

The biggest risk on MacBook for LLM inference is memory pressure causing swapping to SSD. If you are running a 32B model on 96GB unified memory alongside a full browser and IDE, you might be cutting it close. Use Activity Monitor’s Memory tab to watch pressure. If it turns yellow or red, either close other apps or drop to a smaller model.

A practical rule: leave 20-30% of your unified memory free from application usage before loading a model. On a 96GB M3 Max, aim to keep non-model apps under ~25GB.

Quantization Tradeoffs for Apple Silicon

Quantization Model Size (32B) Speed (M3 Max) Code Quality
Q4_K_M ~18GB ~18 tok/s Strong
Q5_K_M ~22GB ~15 tok/s Very Strong
Q8_0 ~34GB ~11 tok/s Near FP16
F16 ~65GB ~6 tok/s Best

For most coding tasks, Q4_K_M or Q5_K_M hits the sweet spot. The quality difference between Q5_K_M and F16 on code generation tasks is measurable but small. The speed difference is significant.

Hardware Recommendations by Budget

MacBook Pro Path

  • Single device for everything (laptop, dev machine, LLM host)
  • No noise, no separate power circuit, no Linux driver headaches
  • M3 Max 96GB runs 32B models at usable speed
  • Battery-powered inference when traveling
  • Resale value holds well

GPU Rig Path

  • Higher throughput at the same model size (3x faster on 32B)
  • Can serve multiple developers simultaneously
  • Easier to run 70B+ models on dual-GPU configs
  • Upgradeable (swap GPU without replacing entire machine)
  • Better for fine-tuning or training runs

For budget guidance:

  • Under $2,500 (solo developer): MacBook Pro M3 Pro 36GB. Runs 14B models excellently, 32B with some patience.
  • $3,000-4,500 (power solo developer): MacBook Pro M3 Max 96GB. Runs 32B comfortably at 18+ tok/s. This is the best all-in-one for most developers.
  • $2,000-3,500 (dedicated GPU rig, used market): RTX 4090 24GB + mid-range workstation. Significantly faster on 32B, but requires a separate machine and Linux comfort.
  • $6,000+ (team server): Dual RTX 4090 or single RTX 6000 Ada (48GB). Serves 5-10 developers concurrently, runs 70B models.

If you want to explore the GPU rig path, a refurbished workstation with an RTX 4090 is the most common starting point. The card alone runs $1,800-2,200 new; the rest of the build is commodity parts.

Beyond Autocomplete: Pair Programming Workflows That Actually Work

A local LLM is most valuable when you build structured workflows around it, not just use it as a smarter tab-complete. Three patterns that work well:

Rubber Duck Debugging With Context. Paste an entire file plus your error into the chat panel. Ask the model to walk through what the code does, then identify where the bug might be. The act of getting a detailed walkback often surfaces the issue immediately.

Spec-to-Stub Generation. Write a comment block describing what a function should do, its inputs, outputs, and edge cases. Ask the model to generate the stub and a corresponding test file. You review, edit, and fill in business logic. This is faster than starting from scratch and gives you test coverage from the beginning.

Code Review Pass Before Commit. Before opening a pull request, run a diff through the local model with the prompt: “Review this diff for logic errors, missing error handling, and potential performance issues. Be specific.” A local 32B model does this surprisingly well and catches things you might miss after staring at code for hours.

Related reading: if you are interested in structuring your AI usage around planning versus execution phases, see our guide on the best LLM workflow for planning vs. execution. And if you are evaluating which hosted models to fall back on when you need more capability than local hardware can provide, our breakdown of ChatGPT vs. Gemini for image and code generation covers current frontier model tradeoffs.

Our Verdict

For most solo developers, a MacBook Pro M3 Max 96GB running Qwen2.5-Coder 32B via Ollama is the best local coding setup available today: fast enough for real pair-programming workflows, private, portable, and requires zero Linux configuration.

Conclusion: Start Local, Scale When Needed

The best local LLM coding setup is the one you actually use. Start with Ollama on whatever hardware you already have. If you have a MacBook Pro M-series with 36GB or more, pull qwen2.5-coder:14b today and wire it into Continue or Cursor. See how it changes your workflow.

If you hit the ceiling (usually: you want faster completions, you need to serve teammates, or you want to experiment with 70B-class models), that is when a dedicated GPU rig makes economic sense. Not before.

The local AI coding ecosystem has matured dramatically. Models like Qwen2.5-Coder 32B are genuinely competitive with GPT-4o on most everyday coding tasks. The infrastructure (Ollama, Continue, llama.cpp) is stable and actively maintained. There has never been a better time to cut the cord from cloud inference for your development workflow.

For more on building AI-powered development workflows, explore our guides on prompt engineering for coding assistants and the latest open-weight model releases on AgentPlix.