tiny-vllm: Build a High‑Performance LLM Inference Engine with C++ and CUDA
Build a lightweight LLM inference engine in C++/CUDA using tiny‑vllm, supporting Llama 3.2 1B Instruct with static/continuous batching and PagedAttention; test on Linux with CUDA 13.1.
Clone the tiny‑vllm repo, install nlohmann/json 3.12.0, run CMake to build the server, execute ./test.sh to validate inference on your GPU, and submit any fixes as pull requests.
Summary
tiny-vllm is a lightweight LLM inference engine written in C++ and CUDA that aims to provide high‑performance inference for small models. The repository ships a full source‑code inference server and a step‑by‑step course that walks through building the engine from scratch. It supports loading a Safetensors‑formatted model such as Llama 3.2 1B Instruct, performing a full forward pass with prefill and decode stages, and includes CUDA kernels for embeddings, RMSNorm, RoPE, GQA, SiLU, and softmax.
The engine implements static and continuous batching, an online softmax routine, and a FlashAttention‑like PagedAttention with a paged KV cache to keep memory usage low. The development environment used for testing is Linux kernel 6.19.8, CUDA 13.1, GCC 15.2.1, and C++17, with the only external dependency being nlohmann/json 3.12.0. The course also covers how to handle the Safetensors file format, including parsing the header and tensor data, and how to load weights into GPU memory. Users can build the project with CMake, run the provided ./test.sh script, and experiment with the inference pipeline on an NVIDIA RTX 5090. The author encourages contributors to fork the repo, adjust paths for their own setup, and submit pull requests to improve the engine.
Key changes
- Supports Llama 3.2 1B Instruct via Safetensors format
- Full forward pass with prefill and decode stages implemented in CUDA
- Static and continuous batching mechanisms for parallel request handling
- FlashAttention‑like PagedAttention with paged KV cache and online softmax
- CUDA kernels for embeddings, RMSNorm, RoPE, GQA, SiLU, and softmax
- Single‑token inference capability
- Uses nlohmann/json 3.12.0 for Safetensors parsing
- Build environment: Linux 6.19.8, CUDA 13.1, GCC 15.2.1, C++17