SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos
- URL: http://arxiv.org/abs/2510.02916v1
- Date: Fri, 03 Oct 2025 11:37:50 GMT
- Title: SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos
- Authors: Amir Dellali, Luca A. Lanzendörfer, Florian Grötschla, Roger Wattenhofer,
- Abstract summary: We propose SALSA-V, a multimodal video-to-audio generation model capable of synthesizing highly synchronized, high-fidelity long-form audio from silent video content.<n>Our approach introduces a masked diffusion objective, enabling audio-conditioned generation and the seamless synthesis of audio sequences of unconstrained length.<n>We demonstrate that SALSA-V significantly outperforms existing state-of-the-art methods in both audiovisual alignment and synchronization with video content in quantitative evaluation and a human listening study.
- Score: 38.40457780873775
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose SALSA-V, a multimodal video-to-audio generation model capable of synthesizing highly synchronized, high-fidelity long-form audio from silent video content. Our approach introduces a masked diffusion objective, enabling audio-conditioned generation and the seamless synthesis of audio sequences of unconstrained length. Additionally, by integrating a shortcut loss into our training process, we achieve rapid generation of high-quality audio samples in as few as eight sampling steps, paving the way for near-real-time applications without requiring dedicated fine-tuning or retraining. We demonstrate that SALSA-V significantly outperforms existing state-of-the-art methods in both audiovisual alignment and synchronization with video content in quantitative evaluation and a human listening study. Furthermore, our use of random masking during training enables our model to match spectral characteristics of reference audio samples, broadening its applicability to professional audio synthesis tasks such as Foley generation and sound design.
Related papers
- Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models [42.75068463173552]
We tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing.<n>Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation.<n>We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks.
arXiv Detail & Related papers (2026-02-24T15:01:39Z) - ALIVE: Animate Your World with Lifelike Audio-Video Generation [50.693986608051716]
ALIVE is a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation.<n>To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch.<n>ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions.
arXiv Detail & Related papers (2026-02-09T14:06:03Z) - ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation [55.76423101183408]
ViSAudio is an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture.<n>It generates high-quality audio with spatial immersion that adapts to viewpoint changes, sound-source motion, and diverse acoustic environments.
arXiv Detail & Related papers (2025-12-02T18:56:12Z) - AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation [24.799628787198397]
AudioGen- Omni generates high-fidelity audio, speech, and song coherently synchronized with the input video.<n>Joint training paradigm integrates large-scale video-text-audio corpora.<n>Dense frame-level representations are fused using an AdaLN-based joint attention mechanism.<n>With an inference time of 1.91 seconds for 8 seconds of audio, it offers substantial improvements in both efficiency and generality.
arXiv Detail & Related papers (2025-08-01T16:03:57Z) - Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance [33.1393328136321]
We propose a step-by-step video-to-audio (V2A) generation method for finer controllability over the generation process and more realistic audio synthesis.<n>Inspired by traditional Foley, our approach aims to capture all sound events induced by a video through the incremental generation of missing sound events.
arXiv Detail & Related papers (2025-06-26T04:20:08Z) - Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation [27.20097004987987]
We propose a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content.<n>Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance.
arXiv Detail & Related papers (2025-06-24T16:39:39Z) - MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis [56.01110988816489]
We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio.<n> MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples.<n> MMAudio achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance.
arXiv Detail & Related papers (2024-12-19T18:59:55Z) - SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models.
Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.<n>We propose Frieren, a V2A model based on rectified flow matching.<n>Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - FoleyGAN: Visually Guided Generative Adversarial Network-Based
Synchronous Sound Generation in Silent Videos [0.0]
We introduce a novel task of guiding a class conditioned generative adversarial network with the temporal visual information of a video input for visual to sound generation task.
Our proposed FoleyGAN model is capable of conditioning action sequences of visual events leading towards generating visually aligned realistic sound tracks.
arXiv Detail & Related papers (2021-07-20T04:59:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.