NAVA – 6.3B Joint Audio‑Video Generator with Align‑then‑Fuse MMDiT
Integrate the 6.3B NAVA model into your pipeline to generate synchronized audio‑video from a single prompt, leveraging its Align‑then‑Fuse MMDiT for better alignment and lower parameter count.
Download the NAVA checkpoint, place it in your models directory, and update your workflow to use the Align‑then‑Fuse MMDiT node for synchronized audio‑video generation.
Summary
NAVA is a 6.3‑billion‑parameter joint audio‑video generator that synthesizes synchronized video and audio from a single prompt, supporting multi‑speaker speech with reference‑timbre control and image‑conditioned continuations. Instead of post‑hoc alignment or fully unified tri‑modal stacks, NAVA employs an Align‑then‑Fuse MMDiT architecture: a dedicated alignment space first establishes audio‑video correspondence, then context (text, speaker embeddings) is fused via cross‑attention. On the Verse‑Bench benchmark, NAVA sets new state‑of‑the‑art results on Sync‑C, Sync‑D, video quality, and audio WER while using 2× to 5× fewer parameters than open‑source baselines. The model is available on Hugging Face and GitHub, with a dedicated project page for documentation. NAVA’s architecture allows efficient inference and lower memory footprint compared to larger multimodal models. The release includes pretrained checkpoints and guidance on integrating the model into existing pipelines.
Users can leverage NAVA to generate high‑fidelity audio‑video content from textual prompts, benefiting from its efficient parameter usage and robust alignment mechanism. The model is particularly suited for applications requiring synchronized speech and video, such as virtual assistants, content creation, and interactive media.
The project encourages experimentation with reference‑timbre control and image‑conditioned continuations to further tailor output to specific use cases.
Key changes
- 6.3‑billion‑parameter joint audio‑video generator
- Align‑then‑Fuse MMDiT architecture for alignment and cross‑attention fusion
- Sets new state‑of‑the‑art on Sync‑C, Sync‑D, video quality, and audio WER
- Uses 2× to 5× fewer parameters than open‑source baselines
- Supports multi‑speaker speech with reference‑timbre control
- Allows image‑conditioned continuations from a single prompt
- Requires only a single prompt input
- Compatible with Verse‑Bench evaluation