MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation
- URL: http://arxiv.org/abs/2212.09478v2
- Date: Fri, 24 Mar 2023 12:13:52 GMT
- Title: MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation
- Authors: Ludan Ruan and Yiyang Ma and Huan Yang and Huiguo He and Bei Liu and
Jianlong Fu and Nicholas Jing Yuan and Qin Jin and Baining Guo
- Abstract summary: We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
- Score: 70.74377373885645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose the first joint audio-video generation framework that brings
engaging watching and listening experiences simultaneously, towards
high-quality realistic videos. To generate joint audio-video pairs, we propose
a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled
denoising autoencoders. In contrast to existing single-modal diffusion models,
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising
process by design. Two subnets for audio and video learn to gradually generate
aligned audio-video pairs from Gaussian noises. To ensure semantic consistency
across modalities, we propose a novel random-shift based attention block
bridging over the two subnets, which enables efficient cross-modal alignment,
and thus reinforces the audio-video fidelity for each other. Extensive
experiments show superior results in unconditional audio-video generation, and
zero-shot conditional tasks (e.g., video-to-audio). In particular, we achieve
the best FVD and FAD on Landscape and AIST++ dancing datasets. Turing tests of
10k votes further demonstrate dominant preferences for our model. The code and
pre-trained models can be downloaded at
https://github.com/researchmm/MM-Diffusion.
Related papers
- MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation [14.28357169715152]
We introduce a novel multi-modal latent diffusion model (MM-LDM) for the task.
We first unify the representation of audio and video data by converting them into a single or a couple of images.
Then, we introduce a hierarchical multi-modal autoencoder that constructs a low-level perceptual latent space for each modality and a shared high-level semantic feature space.
arXiv Detail & Related papers (2024-10-02T14:32:24Z) - A Simple but Strong Baseline for Sounding Video Generation: Effective
Adaptation of Audio and Video Diffusion Models for Joint Generation [16.7129644489202]
Given base diffusion models for audio and video, we integrate them with additional modules into a single model and train it to make the model jointly generate audio and video.
To enhance alignment between audio-video pairs, we introduce two novel mechanisms in our model.
arXiv Detail & Related papers (2024-09-26T05:39:52Z) - SF-V: Single Forward Video Generation Model [57.292575082410785]
We propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained models.
Experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead.
arXiv Detail & Related papers (2024-06-06T17:58:27Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities [67.89368528234394]
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities.
Video and audio are obtained at much higher rates than text and are roughly aligned in time.
Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models.
arXiv Detail & Related papers (2023-11-09T19:15:12Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - AudioFormer: Audio Transformer learns audio feature representations from
discrete acoustic codes [6.375996974877916]
We propose a method named AudioFormer, which learns audio feature representations through the acquisition of discrete acoustic codes.
Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models.
arXiv Detail & Related papers (2023-08-14T15:47:25Z) - Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion
Models [12.898486592791604]
We present Diff-Foley, a synchronized Video-to-Audio synthesis method with a latent diffusion model (LDM)
We show Diff-Foley achieves state-of-the-art V2A performance on current large scale V2A dataset.
arXiv Detail & Related papers (2023-06-29T12:39:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.