AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation
- URL: http://arxiv.org/abs/2412.15191v1
- Date: Thu, 19 Dec 2024 18:57:21 GMT
- Title: AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation
- Authors: Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, Sergey Tulyakov,
- Abstract summary: AV-Link is a unified framework for Video-to-Audio and Audio-to-Video generation.
We propose a Fusion Block that enables bidirectional information exchange between our backbone video and audio diffusion models.
We evaluate our design choices and demonstrate the ability of our method to achieve synchronized and high-quality audiovisual content.
- Score: 49.6922496382879
- License:
- Abstract: We propose AV-Link, a unified framework for Video-to-Audio and Audio-to-Video generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that enables bidirectional information exchange between our backbone video and audio diffusion models through a temporally-aligned self attention operation. Unlike prior work that uses feature extractors pretrained for other tasks for the conditioning signal, AV-Link can directly leverage features obtained by the complementary modality in a single framework i.e. video features to generate audio, or audio features to generate video. We extensively evaluate our design choices and demonstrate the ability of our method to achieve synchronized and high-quality audiovisual content, showcasing its potential for applications in immersive media generation. Project Page: snap-research.github.io/AVLink/
Related papers
- UniForm: A Unified Diffusion Transformer for Audio-Video Generation [46.1185397912308]
UniForm is a unified diffusion transformer designed to enhance cross-modal consistency.
By concatenating auditory and visual information, UniForm learns to generate audio and video simultaneously.
Experiments demonstrate the superior performance of our method in joint audio-video generation, audio-guided video generation, and video-guided audio generation tasks.
arXiv Detail & Related papers (2025-02-06T09:18:30Z) - AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation [62.682428307810525]
We introduce AVS-Mamba, a selective state space model to address the audio-visual segmentation task.
Our framework incorporates two key components for video understanding and cross-modal learning.
Our approach achieves new state-of-the-art results on the AVSBench-object and AVS-semantic datasets.
arXiv Detail & Related papers (2025-01-14T03:20:20Z) - TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation [4.019144083959918]
We present TANGO, a framework for generating co-speech body-gesture videos.
Given a few-minute, single-speaker reference video, TANGO produces high-fidelity videos with synchronized body gestures.
arXiv Detail & Related papers (2024-10-05T16:30:46Z) - From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation [17.95017332858846]
We introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation.
VAB uses a pre-trained audio tokenizer and an image encoder to obtain audio tokens and visual features, respectively.
Our experiments showcase the efficiency of VAB in producing high-quality audio from video, and its capability to acquire semantic audio-visual features.
arXiv Detail & Related papers (2024-09-27T20:26:34Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion
Models [12.898486592791604]
We present Diff-Foley, a synchronized Video-to-Audio synthesis method with a latent diffusion model (LDM)
We show Diff-Foley achieves state-of-the-art V2A performance on current large scale V2A dataset.
arXiv Detail & Related papers (2023-06-29T12:39:58Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.