A Simple but Strong Baseline for Sounding Video Generation: Effective
Adaptation of Audio and Video Diffusion Models for Joint Generation
- URL: http://arxiv.org/abs/2409.17550v1
- Date: Thu, 26 Sep 2024 05:39:52 GMT
- Title: A Simple but Strong Baseline for Sounding Video Generation: Effective
Adaptation of Audio and Video Diffusion Models for Joint Generation
- Authors: Masato Ishii and Akio Hayakawa and Takashi Shibuya and Yuki Mitsufuji
- Abstract summary: Given base diffusion models for audio and video, we integrate them with additional modules into a single model and train it to make the model jointly generate audio and video.
To enhance alignment between audio-video pairs, we introduce two novel mechanisms in our model.
- Score: 16.7129644489202
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we build a simple but strong baseline for sounding video
generation. Given base diffusion models for audio and video, we integrate them
with additional modules into a single model and train it to make the model
jointly generate audio and video. To enhance alignment between audio-video
pairs, we introduce two novel mechanisms in our model. The first one is
timestep adjustment, which provides different timestep information to each base
model. It is designed to align how samples are generated along with timesteps
across modalities. The second one is a new design of the additional modules,
termed Cross-Modal Conditioning as Positional Encoding (CMC-PE). In CMC-PE,
cross-modal information is embedded as if it represents temporal position
information, and the embeddings are fed into the model like positional
encoding. Compared with the popular cross-attention mechanism, CMC-PE provides
a better inductive bias for temporal alignment in the generated data.
Experimental results validate the effectiveness of the two newly introduced
mechanisms and also demonstrate that our method outperforms existing methods.
Related papers
- Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling [21.380988939240844]
We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio.
We propose a joint contrastive training loss to improve the synchronization between visual and auditory occurrences.
arXiv Detail & Related papers (2023-12-08T23:55:19Z) - Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism.
The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders.
Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z) - An Efficient Membership Inference Attack for the Diffusion Model by
Proximal Initialization [58.88327181933151]
In this paper, we propose an efficient query-based membership inference attack (MIA)
Experimental results indicate that the proposed method can achieve competitive performance with only two queries on both discrete-time and continuous-time diffusion models.
To the best of our knowledge, this work is the first to study the robustness of diffusion models to MIA in the text-to-speech task.
arXiv Detail & Related papers (2023-05-26T16:38:48Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - Speech Prediction in Silent Videos using Variational Autoencoders [29.423462898526605]
We present a model for generating speech in a silent video.
The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory's conditional distribution.
We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
arXiv Detail & Related papers (2020-11-14T17:09:03Z) - TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map.
Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z) - Audio-Visual Decision Fusion for WFST-based and seq2seq Models [3.2771898634434997]
Under noisy conditions, speech recognition systems suffer from high Word Error Rates (WER)
We propose novel methods to fuse information from audio and visual modalities at inference time.
We show that our methods give significant improvements over acoustic-only WER.
arXiv Detail & Related papers (2020-01-29T13:45:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.