Back to Articles Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models
Deploy: Load Nemotron‑Labs Diffusion 8B model in SGLang and run inference in self‑speculation mode to achieve up to 6× speed.
Deploy: Load Nemotron‑Labs Diffusion 8B model in SGLang and run inference in self‑speculation mode to achieve up to 6× speed.
Summary
On May 23 2026 NVIDIA released the Nemotron‑Labs Diffusion family, a set of diffusion language models that generate tokens in parallel and refine them over multiple steps. The collection includes 3B, 8B, and 14B scale models, each with instruction‑tuned chat variants, and is available under the NVIDIA Nemotron Open Model License. Nemotron‑Labs Diffusion supports three inference modes: standard autoregressive, diffusion drafting, and self‑speculation, the latter combining diffusion drafting with autoregressive verification. In benchmark tests, the 8B diffusion mode achieves 2.6× higher tokens‑per‑forward‑pass than its autoregressive counterpart, while self‑speculation boosts throughput to 6× with comparable accuracy. The models were jointly trained on 1.3 T pretraining tokens and fine‑tuned on 45 B supervised tokens, preserving the original AR capabilities while adding parallel decoding. Deployment is supported via SGLang, vLLM, Hugging Face Inference Endpoints, and any OpenAI‑compatible provider, all without code changes to switch modes. The diffusion objective uses a block‑wise attention mechanism that remains KV‑cache friendly, enabling efficient GPU utilization. This release gives developers a single checkpoint that can be used for both high‑accuracy autoregressive inference and ultra‑fast diffusion inference.
Key changes
- Three generation modes: Autoregressive, Diffusion, Self‑speculation
- Diffusion mode 2.6× higher TPF than AR, self‑speculation 6×
- Models at 3B, 8B, 14B scales with instruction‑tuned chat variants
- Joint AR and diffusion objective preserves AR capabilities
- Training on 1.3 T pretraining tokens, 45 B supervised tokens
- Supports inference via SGLang, vLLM, Hugging Face Inference Endpoints, OpenAI‑compatible providers
- Self‑speculation drafts bidirectionally then verifies with AR
- No code changes required to switch modes