Back to Articles Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models

OpenAI

Deploy: Load Nemotron‑Labs Diffusion 8B model in SGLang and run inference in self‑speculation mode to achieve up to 6× speed.

What to do now

Deploy: Load Nemotron‑Labs Diffusion 8B model in SGLang and run inference in self‑speculation mode to achieve up to 6× speed.

Summary

On May 23 2026 NVIDIA released the Nemotron‑Labs Diffusion family, a set of diffusion language models that generate tokens in parallel and refine them over multiple steps. The collection includes 3B, 8B, and 14B scale models, each with instruction‑tuned chat variants, and is available under the NVIDIA Nemotron Open Model License. Nemotron‑Labs Diffusion supports three inference modes: standard autoregressive, diffusion drafting, and self‑speculation, the latter combining diffusion drafting with autoregressive verification. In benchmark tests, the 8B diffusion mode achieves 2.6× higher tokens‑per‑forward‑pass than its autoregressive counterpart, while self‑speculation boosts throughput to 6× with comparable accuracy. The models were jointly trained on 1.3 T pretraining tokens and fine‑tuned on 45 B supervised tokens, preserving the original AR capabilities while adding parallel decoding. Deployment is supported via SGLang, vLLM, Hugging Face Inference Endpoints, and any OpenAI‑compatible provider, all without code changes to switch modes. The diffusion objective uses a block‑wise attention mechanism that remains KV‑cache friendly, enabling efficient GPU utilization. This release gives developers a single checkpoint that can be used for both high‑accuracy autoregressive inference and ultra‑fast diffusion inference.

Key changes

Three generation modes: Autoregressive, Diffusion, Self‑speculation
Diffusion mode 2.6× higher TPF than AR, self‑speculation 6×
Models at 3B, 8B, 14B scales with instruction‑tuned chat variants
Joint AR and diffusion objective preserves AR capabilities
Training on 1.3 T pretraining tokens, 45 B supervised tokens
Supports inference via SGLang, vLLM, Hugging Face Inference Endpoints, OpenAI‑compatible providers
Self‑speculation drafts bidirectionally then verifies with AR
No code changes required to switch modes

Affects

enterprise

Story evolution

Customer impact

Analyzing matches…

Ask about this story

Impact on an agency? Which customers? Compare historically Risks of waiting