NAVA – 6.3B Joint Audio‑Video Generator with Align‑then‑Fuse MMDiT

by /u/AgeNo5351 ·

Integrate the 6.3B NAVA model into your pipeline to generate synchronized audio‑video from a single prompt, leveraging its Align‑then‑Fuse MMDiT for better alignment and lower parameter count.

What to do now

Download the NAVA checkpoint, place it in your models directory, and update your workflow to use the Align‑then‑Fuse MMDiT node for synchronized audio‑video generation.

Summary

NAVA is a 6.3‑billion‑parameter joint audio‑video generator that synthesizes synchronized video and audio from a single prompt, supporting multi‑speaker speech with reference‑timbre control and image‑conditioned continuations. Instead of post‑hoc alignment or fully unified tri‑modal stacks, NAVA employs an Align‑then‑Fuse MMDiT architecture: a dedicated alignment space first establishes audio‑video correspondence, then context (text, speaker embeddings) is fused via cross‑attention. On the Verse‑Bench benchmark, NAVA sets new state‑of‑the‑art results on Sync‑C, Sync‑D, video quality, and audio WER while using 2× to 5× fewer parameters than open‑source baselines. The model is available on Hugging Face and GitHub, with a dedicated project page for documentation. NAVA’s architecture allows efficient inference and lower memory footprint compared to larger multimodal models. The release includes pretrained checkpoints and guidance on integrating the model into existing pipelines.

Users can leverage NAVA to generate high‑fidelity audio‑video content from textual prompts, benefiting from its efficient parameter usage and robust alignment mechanism. The model is particularly suited for applications requiring synchronized speech and video, such as virtual assistants, content creation, and interactive media.

The project encourages experimentation with reference‑timbre control and image‑conditioned continuations to further tailor output to specific use cases.

Key changes

6.3‑billion‑parameter joint audio‑video generator
Align‑then‑Fuse MMDiT architecture for alignment and cross‑attention fusion
Sets new state‑of‑the‑art on Sync‑C, Sync‑D, video quality, and audio WER
Uses 2× to 5× fewer parameters than open‑source baselines
Supports multi‑speaker speech with reference‑timbre control
Allows image‑conditioned continuations from a single prompt
Requires only a single prompt input
Compatible with Verse‑Bench evaluation

Affects

internal

Story evolution

Source angles · 2 perspectives

Black Forest Labs (Reddit)

Independent angle

Nava - A 6.3B audio-video model .

Open

r/StableDiffusion

Independent angle

NAVA – 6.3B Joint Audio‑Video Generator with Align‑then‑Fuse MMDiT

Open

Customer impact

Analyzing matches…

Ask about this story

Impact on an agency? Which customers? Compare historically Risks of waiting