Long-Video Audio Synthesis with Multi-Agent Collaboration
- URL: http://arxiv.org/abs/2503.10719v2
- Date: Mon, 17 Mar 2025 05:48:37 GMT
- Title: Long-Video Audio Synthesis with Multi-Agent Collaboration
- Authors: Yehang Zhang, Xinli Xu, Xiaojie Xu, Li Liu, Yingcong Chen,
- Abstract summary: LVAS-Agent is a novel framework that emulates professional dubbing through collaborative role.<n>Our approach decomposes long-video synthesis into four steps including scene segmentation, script generation, sound design and audio synthesis.<n>Central innovations include a discussion-correction mechanism for scene/script refinement and a generation-retrieval loop for temporal-semantic alignment.
- Score: 20.332328741375363
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video-to-audio synthesis, which generates synchronized audio for visual content, critically enhances viewer immersion and narrative coherence in film and interactive media. However, video-to-audio dubbing for long-form content remains an unsolved challenge due to dynamic semantic shifts, temporal misalignment, and the absence of dedicated datasets. While existing methods excel in short videos, they falter in long scenarios (e.g., movies) due to fragmented synthesis and inadequate cross-scene consistency. We propose LVAS-Agent, a novel multi-agent framework that emulates professional dubbing workflows through collaborative role specialization. Our approach decomposes long-video synthesis into four steps including scene segmentation, script generation, sound design and audio synthesis. Central innovations include a discussion-correction mechanism for scene/script refinement and a generation-retrieval loop for temporal-semantic alignment. To enable systematic evaluation, we introduce LVAS-Bench, the first benchmark with 207 professionally curated long videos spanning diverse scenarios. Experiments demonstrate superior audio-visual alignment over baseline methods. Project page: https://lvas-agent.github.io
Related papers
- Multimodal Long Video Modeling Based on Temporal Dynamic Context [13.979661295432964]
We propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC)
We segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders.
To handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments.
arXiv Detail & Related papers (2025-04-14T17:34:06Z) - WikiVideo: Article Generation from Multiple Videos [67.59430517160065]
We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple videos about real-world events.
We introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims.
We propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos.
arXiv Detail & Related papers (2025-04-01T16:22:15Z) - ReelWave: A Multi-Agent Framework Toward Professional Movie Sound Generation [72.22243595269389]
Film production is an important application for generative audio, where richer context is provided through multiple scenes.<n>We propose a multi-agent framework for audio generation inspired by the professional movie production process.<n>Our framework can capture a richer context of audio generation conditioned on video clips extracted from movies.
arXiv Detail & Related papers (2025-03-10T11:57:55Z) - Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.
In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4.
For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with the generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z) - Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound [6.638504164134713]
Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video.<n>Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges.<n>We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as an intuitive condition with semantic timbre prompts.
arXiv Detail & Related papers (2024-08-21T18:06:15Z) - MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation [43.35578187209748]
Foley audio faces significant challenges in the AI-generated content (AIGC) landscape.
Current text-to-audio technology relies on detailed and acoustically relevant textual descriptions.
We introduce the Multi-modal Image and Narrative Text Dubbing dataset (MINT)
MINT is designed to enhance mainstream dubbing tasks such as literary story audiobooks dubbing, image/silent video dubbing.
arXiv Detail & Related papers (2024-06-15T10:47:36Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain
Adaptation [74.51546366251753]
Video topic segmentation unveils the coarse-grained semantic structure underlying videos.
We introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames.
Our proposed solution significantly surpasses baseline methods in terms of both accuracy and transferability.
arXiv Detail & Related papers (2023-11-30T21:59:05Z) - STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment [61.83340833859382]
Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks.
This is a nontemporal problem and poses two critical challenges: sparse-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations.
We propose a continual audio-video pre-training method with two novel ideas.
arXiv Detail & Related papers (2023-10-12T10:50:21Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - AudioLM: a Language Modeling Approach to Audio Generation [59.19364975706805]
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency.
We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure.
We demonstrate how our approach extends beyond speech by generating coherent piano music continuations.
arXiv Detail & Related papers (2022-09-07T13:40:08Z) - Sound2Sight: Generating Visual Dynamics from Sound and Context [36.38300120482868]
We present Sound2Sight, a deep variational framework, that is trained to learn a per frame prior conditioned on a joint embedding of audio and past frames.
To improve the quality and coherence of the generated frames, we propose a multimodal discriminator.
Our experiments demonstrate that Sound2Sight significantly outperforms the state of the art in the generated video quality.
arXiv Detail & Related papers (2020-07-23T16:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.