UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation
- URL: http://arxiv.org/abs/2502.03897v4
- Date: Tue, 15 Apr 2025 06:53:12 GMT
- Title: UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation
- Authors: Lei Zhao, Linfeng Feng, Dongxu Ge, Rujin Chen, Fangqiu Yi, Chi Zhang, Xiao-Lei Zhang, Xuelong Li,
- Abstract summary: UniForm is a unified multi-task diffusion transformer that jointly generates audio and visual modalities in a shared latent space.<n>A single diffusion process models both audio and video, capturing the inherent correlations between sound and vision.<n>By leveraging large language models and a large-scale text-audio-video combined dataset, UniForm achieves greater generative diversity than prior approaches.
- Score: 44.21422404659117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rise of diffusion models, audio-video generation has been revolutionized. However, most existing methods rely on separate modules for each modality, with limited exploration of unified generative architectures. In addition, many are confined to a single task and small-scale datasets. To address these limitations, we first propose UniForm, a unified multi-task diffusion transformer that jointly generates audio and visual modalities in a shared latent space. A single diffusion process models both audio and video, capturing the inherent correlations between sound and vision. Second, we introduce task-specific noise schemes and task tokens, enabling a single model to support multiple tasks, including text-to-audio-video, audio-to-video, and video-to-audio generation. Furthermore, by leveraging large language models and a large-scale text-audio-video combined dataset, UniForm achieves greater generative diversity than prior approaches. Extensive experiments show that UniForm achieves the state-of-the-art performance across audio-video generation tasks, producing content that is both well-aligned and close to real-world data distributions. Our demos are available at https://uniform-t2av.github.io/.
Related papers
- AudioX: Diffusion Transformer for Anything-to-Audio Generation [72.84633243365093]
AudioX is a unified Diffusion Transformer model for Anything-to-Audio and Music Generation.
It can generate both general audio and music with high quality, while offering flexible natural language control.
To address data scarcity, we curate two datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset.
arXiv Detail & Related papers (2025-03-13T16:30:59Z) - AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation [49.6922496382879]
AV-Link is a unified framework for Video-to-Audio and Audio-to-Video generation.<n>We propose a Fusion Block that enables bidirectional information exchange between our backbone video and audio diffusion models.<n>We evaluate our design choices and demonstrate the ability of our method to achieve synchronized and high-quality audiovisual content.
arXiv Detail & Related papers (2024-12-19T18:57:21Z) - YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls [10.429203168607147]
YingSound is a foundation model designed for video-guided sound generation.
It supports high-quality audio generation in few-shot settings.
We show that YingSound effectively generates high-quality synchronized sounds through automated evaluations and human studies.
arXiv Detail & Related papers (2024-12-12T10:55:57Z) - Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models [56.776580717999806]
Real-world applications often involve processing multiple audio streams simultaneously.
We propose the first multi-audio evaluation benchmark that consists of 20 datasets from 11 multi-audio tasks.
We propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios.
arXiv Detail & Related papers (2024-09-27T12:06:53Z) - Read, Watch and Scream! Sound Generation from Text and Video [23.990569918960315]
Video serves as a conditional control for a text-to-audio generation model.<n>We employ a well-performing text-to-audio model to consolidate the video control.<n>Our method shows superiority in terms of quality, controllability, and training efficiency.
arXiv Detail & Related papers (2024-07-08T01:59:17Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.