Related papers: MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization

MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization

URL: http://arxiv.org/abs/2509.19999v2
Date: Tue, 04 Nov 2025 09:35:36 GMT
Title: MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization
Authors: Jianxuan Yang, Xiaoran Yang, Lipan Zhang, Xinyue Guo, Zhao Wang, Gongping Huang,
Abstract summary: Current video-to-audio (V2A) methods struggle in complex multi-event scenarios.<n>This study proposes a novel V2A framework: MultiSoundGen.<n>It introduces direct preference optimization (DPO) into the V2A domain.
Score: 10.717164013707693
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current video-to-audio (V2A) methods struggle in complex multi-event scenarios (video scenarios involving multiple sound sources, sound events, or transitions) due to two critical limitations. First, existing methods face challenges in precisely aligning intricate semantic information together with rapid dynamic features. Second, foundational training lacks quantitative preference optimization for semantic-temporal alignment and audio quality. As a result, it fails to enhance integrated generation quality in cluttered multi-event scenes. To address these core limitations, this study proposes a novel V2A framework: MultiSoundGen. It introduces direct preference optimization (DPO) into the V2A domain, leveraging audio-visual pretraining (AVP) to enhance performance in complex multi-event scenarios. Our contributions include two key innovations: the first is SlowFast Contrastive AVP (SF-CAVP), a pioneering AVP model with a unified dual-stream architecture. SF-CAVP explicitly aligns core semantic representations and rapid dynamic features of audio-visual data to handle multi-event complexity; second, we integrate the DPO method into V2A task and propose AVP-Ranked Preference Optimization (AVP-RPO). It uses SF-CAVP as a reward model to quantify and prioritize critical semantic-temporal matches while enhancing audio quality. Experiments demonstrate that MultiSoundGen achieves state-of-the-art (SOTA) performance in multi-event scenarios, delivering comprehensive gains across distribution matching, audio quality, semantic alignment, and temporal synchronization. Demos are available at https://v2aresearch.github.io/MultiSoundGen/.

Related papers

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation [112.614973927778]
Joint audio-video generation (JAVG) produces synchronized and semantically aligned sound and vision from textual descriptions.<n>This paper presents JavisDiT++, a framework for unified modeling and optimization of JAVG.<n>Our model achieves state-of-the-art performance merely with around 1M public training entries.
arXiv Detail & Related papers (2026-02-22T12:44:28Z)
GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining [64.72014392166625]
GMS-CAVP is a novel framework that combines Multi-Scale Video-Audio Alignment and Multi-Scale Spatial-Temporal Diffusion-based pretraining objectives.<n>First, GMS-CAVP introduces a multi-scale contrastive learning strategy that captures semantic and temporal relations across varying granularities.<n>Second, we go beyond traditional contrastive learning by incorporating a diffusion-based generative objective, enabling modality translation and synthesis between video and audio.
arXiv Detail & Related papers (2026-01-27T13:43:32Z)
Omni2Sound: Towards Unified Video-Text-to-Audio Generation [56.11583645408007]
Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A) and joint video-text-to-audio (VT2A) generation offers significant application flexibility.<n>SoundAtlas is a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality.<n>We propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities.
arXiv Detail & Related papers (2026-01-06T05:49:41Z)
PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation [57.864929968616586]
Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions.<n>We introduce PrismAudio, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain-of-Thought (CoT) planning.
arXiv Detail & Related papers (2025-11-24T07:11:12Z)
Training-Free Multimodal Guidance for Video to Audio Generation [22.64037676707457]
Video-to-audio (V2A) generation aims to synthesize realistic and semantically aligned audio from silent videos.<n>Existing approaches either require costly joint training on large-scale paired datasets or rely on pairwise similarities.<n>We propose a novel training-free multimodal guidance mechanism for V2A diffusion that leverages the volume spanned by the modality embeddings to enforce unified alignment.
arXiv Detail & Related papers (2025-09-29T10:00:36Z)
SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation [50.03810359300705]
SpA2V decomposes the generation process into two stages: audio-guided video planning and layout-grounded video generation.<n>We show that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.
arXiv Detail & Related papers (2025-08-01T17:05:04Z)
PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling [78.61911985138795]
We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams.<n>We propose the Predictive Future Modeling framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues.<n>Experiments show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters.
arXiv Detail & Related papers (2025-05-29T06:46:19Z)
YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls [10.429203168607147]
YingSound is a foundation model designed for video-guided sound generation.<n>It supports high-quality audio generation in few-shot settings.<n>We show that YingSound effectively generates high-quality synchronized sounds through automated evaluations and human studies.
arXiv Detail & Related papers (2024-12-12T10:55:57Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models [12.898486592791604]
We present Diff-Foley, a synchronized Video-to-Audio synthesis method with a latent diffusion model (LDM) We show Diff-Foley achieves state-of-the-art V2A performance on current large scale V2A dataset.
arXiv Detail & Related papers (2023-06-29T12:39:58Z)
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously. MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z)
AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization [14.103742565510387]
We introduce AVE-CLIP, a novel framework that integrates the AudioCLIP pre-trained on large-scale audio-visual data with a multi-window temporal transformer. Our method achieves state-of-the-art performance on the publicly available AVE dataset with 5.9% mean accuracy improvement.
arXiv Detail & Related papers (2022-10-11T00:15:45Z)
Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness of Multi-Stream End-to-End ASR [35.7018440502825]
In a multi-stream paradigm, improving robustness takes account of handling a variety of unseen single-stream conditions and inter-stream dynamics. We introduce a two-stage augmentation scheme focusing on mismatch scenarios. Compared with the previous training strategy, substantial improvements are reported with relative word error rate reductions of 29.7-59.3%.
arXiv Detail & Related papers (2021-02-05T08:36:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.