Build a Fully Local Voice Agent from Scratch: Whisper + LLM + Kokoro

Building a voice agent that actually responds to you in real time, with no cloud latency, no per-token bill, and no data leaving your machine, is now within reach for anyone with a modern laptop. This guide walks you through wiring three open-source tools into a complete voice pipeline: OpenAI Whisper for speech-to-text, a quantized local LLM (via Ollama or llama.cpp) for reasoning, and Kokoro TTS for expressive speech output. By the end, you will have a working voice agent built entirely from local components.

This is not a high-level overview. We are going line by line.


Why Go Fully Local?

The obvious answer is cost. A voice agent hitting GPT-4o Realtime can burn through $0.06 per minute of audio in and out. At any meaningful usage volume, that adds up fast. But cost is only part of the story.

Latency is the other half. Cloud round-trips add 300 to 800 ms before your LLM even starts thinking. With a local stack, the STT and TTS run on your GPU or Apple Silicon, and the LLM never leaves your machine. Properly tuned, you can hit end-to-end latency under 1.5 seconds on a MacBook M2 Pro.

Privacy matters too. Any voice assistant that routes audio to a third-party API is a liability for sensitive workflows, from legal intake forms to healthcare triage to private journaling tools.

If you have been watching your API bill creep up, running local AI is a real option now. The quality gap has closed dramatically in 2025 and 2026.

💡 Hardware Baseline
You need at least 16 GB of unified memory (Apple Silicon) or a GPU with 8 GB VRAM to run a 7B GGUF model comfortably alongside Whisper base. A 13B model needs 16 GB VRAM or M2 Pro/Max.

The Stack at a Glance

Component Tool Role
Speech-to-Text Whisper (faster-whisper) Transcribes mic audio to text
Language Model Ollama + Mistral 7B / Llama 3 Generates a text response
Text-to-Speech Kokoro TTS Converts response to audio
Audio I/O PyAudio / sounddevice Captures mic, plays back audio

All four are free and open-source. The total one-time setup time is about 30 minutes.


Step 1: Install Dependencies

Start with a clean Python 3.11 virtual environment:

python3 -m venv voice-agent-env
source voice-agent-env/bin/activate
pip install faster-whisper sounddevice numpy scipy
pip install kokoro-tts  # or clone from the Kokoro repo

For the LLM layer, install Ollama. It handles model downloads, GGUF quantization selection, and provides a clean local REST API:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull mistral   # 4.1 GB download, Q4_K_M quantization

Mistral 7B at Q4_K_M is a good default. It fits in 8 GB VRAM and handles conversational tasks well. If you have more headroom, ollama pull llama3 gives noticeably better reasoning for only 2 GB more.


Step 2: Speech-to-Text with faster-whisper

The original OpenAI Whisper is accurate but slow for real-time use. faster-whisper is a re-implementation using CTranslate2 that runs 4 times faster with the same accuracy.

# stt.py
from faster_whisper import WhisperModel
import sounddevice as sd
import numpy as np

SAMPLE_RATE = 16000
DURATION = 5  # seconds to record

model = WhisperModel("base", device="auto", compute_type="int8")

def record_audio(duration=DURATION, sample_rate=SAMPLE_RATE):
    print("Listening...")
    audio = sd.rec(
        int(duration * sample_rate),
        samplerate=sample_rate,
        channels=1,
        dtype="float32"
    )
    sd.wait()
    return audio.flatten()

def transcribe(audio_array):
    segments, _ = model.transcribe(audio_array, language="en")
    return " ".join(seg.text for seg in segments).strip()

A few notes on this implementation. The device="auto" flag picks CUDA if available, then falls back to CPU. The int8 compute type halves memory use with minimal quality loss on Whisper base. For production use, you would replace the fixed-duration recording with voice activity detection (VAD), but for a tutorial this is clean and functional.

⚠️ VAD Note
Fixed-duration recording means the agent always waits 5 seconds before responding. For a snappier feel, add Silero VAD: it detects when you stop speaking and triggers transcription immediately. We cover that in the advanced section below.

Step 3: Local LLM via Ollama

Ollama exposes a REST API at localhost:11434. The simplest integration is a direct HTTP call:

# llm.py
import requests
import json

OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL = "mistral"

def generate_response(user_text: str, system_prompt: str = "") -> str:
    payload = {
        "model": MODEL,
        "prompt": user_text,
        "system": system_prompt or "You are a helpful voice assistant. Keep responses concise, under 3 sentences.",
        "stream": False
    }
    response = requests.post(OLLAMA_URL, json=payload)
    return response.json()["response"].strip()

The stream: False flag waits for the full response before returning. For lower perceived latency you can stream tokens and start TTS as soon as the first sentence is complete. That is the optimization that takes you from 3 seconds to under 1.5 seconds, and we cover it in the tuning section below.

The system prompt is doing real work here. Voice responses need to be short. A 200-word answer sounds terrible as synthesized audio. Constraining the model to 2 to 3 sentences is the single most impactful prompt-level change you can make.

If you want to go deeper on building agents with LLM backends, the guide on building your first AI agent with Claude API covers tool use, memory, and more complex orchestration patterns that you can adapt to this local stack.


Step 4: Text-to-Speech with Kokoro

Kokoro is a lightweight, expressive TTS model that runs locally and produces natural-sounding output. It is one of the best options available for local voice work in 2026. For a full comparison of local TTS options, see our breakdown of the best expressive local AI TTS tools.

# tts.py
from kokoro import KPipeline
import sounddevice as sd
import numpy as np

pipeline = KPipeline(lang_code="a")  # "a" = American English

def speak(text: str, voice: str = "af_heart", speed: float = 1.1):
    generator = pipeline(text, voice=voice, speed=speed)
    for _, _, audio in generator:
        audio_np = np.array(audio)
        sd.play(audio_np, samplerate=24000)
        sd.wait()

The speed=1.1 nudges the output slightly faster than natural speech, which feels more natural for an AI assistant response. The af_heart voice is a good default American English voice.

Kokoro generates audio chunk by chunk via a generator. In the streaming version of this pipeline, you would start playing the first audio chunk while the LLM is still generating the second sentence, cutting perceived latency significantly.

Pros

  • Runs entirely offline, zero API costs
  • Kokoro produces noticeably natural prosody compared to older local TTS
  • faster-whisper is accurate enough for most accents at base model size
  • Full control over system prompt, voice, and pipeline timing
  • No data leaves your machine, safe for sensitive use cases

Cons

  • First-time model downloads are 4 to 8 GB
  • 7B models can hallucinate on complex factual queries
  • Fixed-duration recording feels slow without VAD
  • No built-in multi-turn memory without extra scaffolding

Step 5: Wire the Pipeline Together

Now we connect all three components into a loop:

# voice_agent.py
from stt import record_audio, transcribe
from llm import generate_response
from tts import speak
import time

SYSTEM_PROMPT = (
    "You are a concise voice assistant. "
    "Always reply in 1 to 3 short sentences. "
    "Never use bullet points or markdown."
)

def run():
    print("Voice agent ready. Press Ctrl+C to stop.")
    while True:
        try:
            audio = record_audio(duration=5)
            user_text = transcribe(audio)

            if not user_text or len(user_text) < 3:
                print("(no speech detected, listening again...)")
                continue

            print(f"You: {user_text}")
            response = generate_response(user_text, SYSTEM_PROMPT)
            print(f"Agent: {response}")
            speak(response)

        except KeyboardInterrupt:
            print("\nShutting down.")
            break
        except Exception as e:
            print(f"Error: {e}")
            time.sleep(1)

if __name__ == "__main__":
    run()

Run it with:

python voice_agent.py

The loop records 5 seconds of audio, transcribes it, sends it to your local LLM, and speaks the response. Total latency on an M2 Pro with Mistral 7B is roughly 2 to 3 seconds. Tuning gets it lower.


Latency Tuning: From 3 Seconds to Under 1.5

1. Use VAD instead of fixed-duration recording. Silero VAD detects end-of-speech and triggers the transcription pipeline immediately, cutting average wait time by 1 to 2 seconds for short queries:

pip install silero-vad

Replace the record_audio function with a VAD-gated version that accumulates audio chunks and fires when silence is detected for 0.8 seconds.

2. Stream LLM output to TTS. Instead of waiting for the full response, detect the first sentence boundary (period, question mark, exclamation mark) and pass it to Kokoro immediately while the LLM generates the rest. This creates an overlap between generation and playback.

3. Preload all models at startup. Whisper and Kokoro have initialization overhead. Import them at the top of the script and instantiate once. Do not lazy-load inside the loop.

4. Use a smaller Whisper model. If your input is clean, tiny.en transcribes in under 200 ms on CPU. The accuracy tradeoff is acceptable for native English speakers in low-noise environments.

5. Pin the LLM context window. Set num_ctx: 512 in your Ollama model config for a voice agent. You do not need 4096 tokens of context for 2 to 3 sentence exchanges, and a smaller context window significantly reduces first-token latency.


Adding Multi-Turn Memory

The pipeline above is stateless. Each query starts fresh. For a real assistant, you want the model to remember context across turns. The simplest approach is a rolling message buffer:

history = []

def generate_response_with_memory(user_text: str) -> str:
    history.append({"role": "user", "content": user_text})

    # Build prompt from history (last 6 turns)
    context = "\n".join(
        f"{m['role'].capitalize()}: {m['content']}"
        for m in history[-6:]
    )
    full_prompt = f"{context}\nAssistant:"

    response = generate_response(full_prompt)
    history.append({"role": "assistant", "content": response})
    return response

Six turns of context is usually enough for a conversational assistant and keeps the LLM prompt short enough to avoid latency spikes. For more sophisticated memory patterns, including episodic memory and retrieval-augmented generation, the RAG vs fine-tuning breakdown covers when each approach makes sense.


Deploying as a Background Service

Once the pipeline is stable, you probably want it running automatically. On macOS, a LaunchAgent plist file handles this cleanly. On Linux, a systemd service does the same job. The key is ensuring the venv Python path is hardcoded in the service definition rather than relying on shell PATH resolution.

For remote deployment or headless servers, swap sounddevice for a WebSocket audio stream. Tools like Replit can host the LLM + TTS half of the pipeline while the STT runs client-side, splitting the compute across environments if local GPU is limited.

💡 Production Tip
Wrap the entire pipeline in a try/except with exponential backoff on the Ollama call. Local models occasionally time out on first load. A simple retry with a 2-second wait handles 95% of cold-start failures without crashing the agent loop.

What to Build Next

Once the core pipeline is working, there are several natural directions to extend it:

Tool use. Give the LLM access to functions like web search, calendar lookup, or home automation commands. Structure the system prompt to output JSON when a tool call is needed, parse it, execute the function, and feed the result back into the next LLM turn.

Wake word detection. Add a always-on wake word listener (Porcupine or OpenWakeWord) so the agent activates on a phrase like “hey agent” rather than a keypress.

Custom voices. Kokoro supports voice cloning from a short reference audio clip. Train a custom voice in under 10 minutes on your own hardware.

Niche deployments. A local voice agent is a strong foundation for customer service bots, accessibility tools, and specialized workflows where cloud latency or data privacy rules out hosted options.


Conclusion

You now have a working, fully local voice agent built on three mature open-source components. Whisper handles the ears, a quantized LLM handles the brain, and Kokoro handles the voice. The stack costs nothing to run, keeps your data on-device, and with proper tuning hits latency low enough to feel genuinely conversational.

The gap between local and cloud AI voice quality has effectively closed for most use cases. The gap in cost and privacy has not.

Ready to go further? Check out the guide on building your first AI agent with Claude API for tool use patterns and orchestration techniques that translate directly to local stacks. Or explore what makes Kokoro the best local TTS option in 2026 if you want to fine-tune the voice layer further.

Start with the five-minute version. Get it speaking. Then optimize from there.

Our Verdict

A Whisper + local LLM + Kokoro pipeline is the most practical fully-offline voice agent stack available in 2026, with real-world latency under 1.5 seconds on consumer hardware and zero ongoing API costs.

```