UniForm: A Unified Diffusion Transformer for Audio-Video Generation
- URL: http://arxiv.org/abs/2502.03897v2
- Date: Sat, 08 Feb 2025 09:37:13 GMT
- Title: UniForm: A Unified Diffusion Transformer for Audio-Video Generation
- Authors: Lei Zhao, Linfeng Feng, Dongxu Ge, Fangqiu Yi, Chi Zhang, Xiao-Lei Zhang, Xuelong Li,
- Abstract summary: UniForm is a unified diffusion transformer designed to enhance cross-modal consistency.
By concatenating auditory and visual information, UniForm learns to generate audio and video simultaneously.
Experiments demonstrate the superior performance of our method in joint audio-video generation, audio-guided video generation, and video-guided audio generation tasks.
- Score: 46.1185397912308
- License:
- Abstract: As a natural multimodal content, audible video delivers an immersive sensory experience. Consequently, audio-video generation systems have substantial potential. However, existing diffusion-based studies mainly employ relatively independent modules for generating each modality, which lack exploration of shared-weight generative modules. This approach may under-use the intrinsic correlations between audio and visual modalities, potentially resulting in sub-optimal generation quality. To address this, we propose UniForm, a unified diffusion transformer designed to enhance cross-modal consistency. By concatenating auditory and visual information, UniForm learns to generate audio and video simultaneously within a unified latent space, facilitating the creation of high-quality and well-aligned audio-visual pairs. Extensive experiments demonstrate the superior performance of our method in joint audio-video generation, audio-guided video generation, and video-guided audio generation tasks. Our demos are available at https://uniform-t2av.github.io/.
Related papers
- Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis [56.01110988816489]
We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio.
MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples.
MMAudio achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance.
arXiv Detail & Related papers (2024-12-19T18:59:55Z) - AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation [49.6922496382879]
AV-Link is a unified framework for Video-to-Audio and Audio-to-Video generation.
We propose a Fusion Block that enables bidirectional information exchange between our backbone video and audio diffusion models.
We evaluate our design choices and demonstrate the ability of our method to achieve synchronized and high-quality audiovisual content.
arXiv Detail & Related papers (2024-12-19T18:57:21Z) - Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation [32.24603883810094]
Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models.
We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources.
By leveraging spatial guidance, our unified model achieves the objective of generating immersive and controllable spatial audio from text and image.
arXiv Detail & Related papers (2024-10-14T16:18:29Z) - Read, Watch and Scream! Sound Generation from Text and Video [23.990569918960315]
Video serves as a conditional control for a text-to-audio generation model.
We employ a well-performing text-to-audio model to consolidate the video control.
Our method shows superiority in terms of quality, controllability, and training efficiency.
arXiv Detail & Related papers (2024-07-08T01:59:17Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.