Back to Articles Reachy Mini goes fully local
Deploy: Run speech‑to‑speech locally with llama.cpp Gemma 4 and connect Reachy Mini to the local backend.
Deploy: Run speech‑to‑speech locally with llama.cpp Gemma 4 and connect Reachy Mini to the local backend.
Summary
On May 27 2026, the Reachy Mini community released a guide to run the entire speech‑to‑speech stack locally, eliminating the need for cloud APIs. The new stack uses the open‑source speech‑to‑speech library, which exposes a Realtime API‑compatible /v1/realtime WebSocket and chains a cascaded VAD → STT → LLM → TTS pipeline. The recommended LLM backend is llama.cpp serving the Gemma 4 E4B‑it‑GGUF model via the command `llama-server -hf ggml-org/gemma-4-E4B-it-GGUF -np 2 -c 65536 -fa on --swa-full`. For voice processing, the guide suggests Silero VAD v5 Tiny, Parakeet‑TDT 0.6B v3 for STT, and Qwen3‑TTS for multilingual TTS. The speech‑to‑speech binary can run in `--mode local` to keep all audio on the host, or in `--mode realtime` to stream to the Reachy Mini. The approach allows swapping any component in the cascade, so developers can experiment with newer models from the Hugging Face Hub each week. Privacy is preserved because no audio leaves the machine, and there are no per‑minute API costs.
Key changes
- Full local speech‑to‑speech stack via speech‑to‑speech library exposing /v1/realtime WebSocket
- Uses llama.cpp serving Gemma 4 E4B‑it‑GGUF with flags -hf, -np 2, -c 65536, -fa on, --swa-full
- Recommended components: Silero VAD v5 Tiny, Parakeet‑TDT 0.6B v3 STT, Qwen3‑TTS
- Supports multiple LLM backends: local llama.cpp, vLLM, Hugging Face Inference Endpoints, OpenAI‑compatible providers
- Two modes: --mode local for local inference, --mode realtime for streaming to robot
- Enables swapping any cascade component
- No API keys, no data leaves machine
- Privacy and cost savings