Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition
- URL: http://arxiv.org/abs/2503.06984v1
- Date: Mon, 10 Mar 2025 07:04:03 GMT
- Title: Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition
- Authors: Juncheng Wang, Chao Xu, Cheng Yu, Lei Shang, Zhe Hu, Shujun Wang, Liefeng Bo,
- Abstract summary: We decompose the mel-spectrogram into three distinct types of signals, employing quantization or continuity to them.<n>We can effectively predict them from video by a devised video-to-all (V2X) predictor.<n>Then, the predicted signals are recomposed and fed into a ControlNet, along with a textual inversion design, to control the audio generation process.
- Score: 31.25956665297592
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video-to-audio generation is essential for synthesizing realistic audio tracks that synchronize effectively with silent videos. Following the perspective of extracting essential signals from videos that can precisely control the mature text-to-audio generative diffusion models, this paper presents how to balance the representation of mel-spectrograms in terms of completeness and complexity through a new approach called Mel Quantization-Continuum Decomposition (Mel-QCD). We decompose the mel-spectrogram into three distinct types of signals, employing quantization or continuity to them, we can effectively predict them from video by a devised video-to-all (V2X) predictor. Then, the predicted signals are recomposed and fed into a ControlNet, along with a textual inversion design, to control the audio generation process. Our proposed Mel-QCD method demonstrates state-of-the-art performance across eight metrics, evaluating dimensions such as quality, synchronization, and semantic consistency. Our codes and demos will be released at \href{Website}{https://wjc2830.github.io/MelQCD/}.
Related papers
- Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound [6.638504164134713]
Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video.<n>Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges.<n>We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as an intuitive condition with semantic timbre prompts.
arXiv Detail & Related papers (2024-08-21T18:06:15Z) - Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity [12.848371604063168]
We propose a V2A generative model, named MaskVAT, that interconnects a full-band high-quality general audio with a sequence-to-sequence masked generative model.
Our results show that, by combining a high-quality with the proper pre-trained audio-visual features and a sequence-to-sequence parallel structure, we are able to yield highly synchronized results.
arXiv Detail & Related papers (2024-07-15T01:49:59Z) - Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities [67.89368528234394]
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities.
Video and audio are obtained at much higher rates than text and are roughly aligned in time.
Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models.
arXiv Detail & Related papers (2023-11-09T19:15:12Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Sounding Video Generator: A Unified Framework for Text-guided Sounding
Video Generation [24.403772976932487]
Sounding Video Generator (SVG) is a unified framework for generating realistic videos along with audio signals.
VQGAN transforms visual frames and audio melspectrograms into discrete tokens.
Transformer-based decoder is used to model associations between texts, visual frames, and audio signals.
arXiv Detail & Related papers (2023-03-29T09:07:31Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - End to End Lip Synchronization with a Temporal AutoEncoder [95.94432031144716]
We study the problem of syncing the lip movement in a video with the audio stream.
Our solution finds an optimal alignment using a dual-domain recurrent neural network.
As an application, we demonstrate our ability to robustly align text-to-speech generated audio with an existing video stream.
arXiv Detail & Related papers (2022-03-30T12:00:18Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.