tiny-vllm: Build a High‑Performance LLM Inference Engine with C++ and CUDA

by yu3zhou4 · Llama

Build a lightweight LLM inference engine in C++/CUDA using tiny‑vllm, supporting Llama 3.2 1B Instruct with static/continuous batching and PagedAttention; test on Linux with CUDA 13.1.

What to do now

Clone the tiny‑vllm repo, install nlohmann/json 3.12.0, run CMake to build the server, execute ./test.sh to validate inference on your GPU, and submit any fixes as pull requests.

Summary

tiny-vllm is a lightweight LLM inference engine written in C++ and CUDA that aims to provide high‑performance inference for small models. The repository ships a full source‑code inference server and a step‑by‑step course that walks through building the engine from scratch. It supports loading a Safetensors‑formatted model such as Llama 3.2 1B Instruct, performing a full forward pass with prefill and decode stages, and includes CUDA kernels for embeddings, RMSNorm, RoPE, GQA, SiLU, and softmax.

The engine implements static and continuous batching, an online softmax routine, and a FlashAttention‑like PagedAttention with a paged KV cache to keep memory usage low. The development environment used for testing is Linux kernel 6.19.8, CUDA 13.1, GCC 15.2.1, and C++17, with the only external dependency being nlohmann/json 3.12.0. The course also covers how to handle the Safetensors file format, including parsing the header and tensor data, and how to load weights into GPU memory. Users can build the project with CMake, run the provided ./test.sh script, and experiment with the inference pipeline on an NVIDIA RTX 5090. The author encourages contributors to fork the repo, adjust paths for their own setup, and submit pull requests to improve the engine.

Key changes

Supports Llama 3.2 1B Instruct via Safetensors format
Full forward pass with prefill and decode stages implemented in CUDA
Static and continuous batching mechanisms for parallel request handling
FlashAttention‑like PagedAttention with paged KV cache and online softmax
CUDA kernels for embeddings, RMSNorm, RoPE, GQA, SiLU, and softmax
Single‑token inference capability
Uses nlohmann/json 3.12.0 for Safetensors parsing
Build environment: Linux 6.19.8, CUDA 13.1, GCC 15.2.1, C++17

Affects

internal

Story evolution

Customer impact

Analyzing matches…

Ask about this story

Impact on an agency? Which customers? Compare historically Risks of waiting