Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation
Abstract Overview
Baton is a joint video-audio generation framework that adds an explicit semantic planning stage before diffusion-based synthesis. The method uses a VA-Planner, built on a multimodal language model with dual semantic alignment towers, to generate semantically aligned video and audio planned tokens that act as keyframe-level blueprints. These planned tokens are injected into a dual-branch diffusion transformer, while a Relative Semantic RoPE mechanism aligns the semantic tokens with diffusion latents despite mismatched spatial-temporal grids. Across benchmark and ablation studies, the paper argues that this explicit planning improves stability, prompt following, and cross-modal synchronization, especially for prompts that require multi-step semantic reasoning.
Novelty
The paper presents Baton as the first joint video-audio generation framework to explicitly disentangle semantic planning from synthesis. Its distinctive components are the VA-Planner for producing modality-aware yet mutually aligned planned tokens and Relative Semantic RoPE for aligning those semantic plans with heterogeneous video and audio diffusion latents.
Results
On Verse-Bench, Baton is reported to achieve performance comparable to strong open-source baselines on simpler prompts, while on the more complex Sem100 benchmark it shows clearer gains. In particular, the paper reports improvements over LTX-2 on Sem100 of 32% in prompt accuracy, 76% in multi-speaker word error rate, and 30% in DeSync, with qualitative results and user studies also indicating more stable and synchronized outputs in complex scenes.
Key Points
- Baton introduces explicit semantic blueprints for joint video-audio generation by planning aligned video and audio tokens before denoising.
- The method combines a multimodal VA-Planner with dual semantic alignment towers and Relative Semantic RoPE to connect planned semantics to diffusion generation.
- Empirical results suggest the approach is especially beneficial for semantically complex prompts involving sequential actions, human-object interactions, and multi-speaker dialogue.