FuguReport

Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation

Authors Shuyuan Tu, Qi Tian, Zihan Yang, Yue Wu, Xintong Han, Weijie Kong, Jiangfeng Xiong, Jian-Wei Zhang, Zhao Zhong, Liefeng Bo, Zuxuan Wu, Yu-Gang Jiang
Affiliations Tencent / Fudan University
Categories Method / Multimodal Generation / Joint video-audio generation framework, Evaluation / Benchmarking / Qualitative and quantitative effectiveness, Application / Semantic Planning / Semantically rich multimodal token guidance
License CC BY 4.0

Abstract Overview

Baton is a joint video-audio generation framework that adds an explicit semantic planning stage before diffusion-based synthesis. The method uses a VA-Planner, built on a multimodal language model with dual semantic alignment towers, to generate semantically aligned video and audio planned tokens that act as keyframe-level blueprints. These planned tokens are injected into a dual-branch diffusion transformer, while a Relative Semantic RoPE mechanism aligns the semantic tokens with diffusion latents despite mismatched spatial-temporal grids. Across benchmark and ablation studies, the paper argues that this explicit planning improves stability, prompt following, and cross-modal synchronization, especially for prompts that require multi-step semantic reasoning.

Novelty

The paper presents Baton as the first joint video-audio generation framework to explicitly disentangle semantic planning from synthesis. Its distinctive components are the VA-Planner for producing modality-aware yet mutually aligned planned tokens and Relative Semantic RoPE for aligning those semantic plans with heterogeneous video and audio diffusion latents.

Results

On Verse-Bench, Baton is reported to achieve performance comparable to strong open-source baselines on simpler prompts, while on the more complex Sem100 benchmark it shows clearer gains. In particular, the paper reports improvements over LTX-2 on Sem100 of 32% in prompt accuracy, 76% in multi-speaker word error rate, and 30% in DeSync, with qualitative results and user studies also indicating more stable and synchronized outputs in complex scenes.

Key Points

  1. Baton introduces explicit semantic blueprints for joint video-audio generation by planning aligned video and audio tokens before denoising.
  2. The method combines a multimodal VA-Planner with dual semantic alignment towers and Relative Semantic RoPE to connect planned semantics to diffusion generation.
  3. Empirical results suggest the approach is especially beneficial for semantically complex prompts involving sequential actions, human-object interactions, and multi-speaker dialogue.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.