Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity
- URL: http://arxiv.org/abs/2407.10387v1
- Date: Mon, 15 Jul 2024 01:49:59 GMT
- Title: Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity
- Authors: Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan SerrĂ ,
- Abstract summary: We propose a V2A generative model, named MaskVAT, that interconnects a full-band high-quality general audio with a sequence-to-sequence masked generative model.
Our results show that, by combining a high-quality with the proper pre-trained audio-visual features and a sequence-to-sequence parallel structure, we are able to yield highly synchronized results.
- Score: 12.848371604063168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-to-audio (V2A) generation leverages visual-only video features to render plausible sounds that match the scene. Importantly, the generated sound onsets should match the visual actions that are aligned with them, otherwise unnatural synchronization artifacts arise. Recent works have explored the progression of conditioning sound generators on still images and then video features, focusing on quality and semantic matching while ignoring synchronization, or by sacrificing some amount of quality to focus on improving synchronization only. In this work, we propose a V2A generative model, named MaskVAT, that interconnects a full-band high-quality general audio codec with a sequence-to-sequence masked generative model. This combination allows modeling both high audio quality, semantic matching, and temporal synchronicity at the same time. Our results show that, by combining a high-quality codec with the proper pre-trained audio-visual features and a sequence-to-sequence parallel structure, we are able to yield highly synchronized results on one hand, whilst being competitive with the state of the art of non-codec generative audio models. Sample videos and generated audios are available at https://maskvat.github.io .
Related papers
- Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis [28.172213291270868]
Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects to silent films or videos to enhance the auditory experience.
Video-to-Audio (V2A) presents inherent challenges related to audio-visual synchronization.
We construct a controllable video-to-audio model, termed Draw an Audio, which supports multiple input instructions through drawn masks and loudness signals.
arXiv Detail & Related papers (2024-09-10T01:07:20Z) - FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds [14.636030346325578]
We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience.
We propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation.
One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents.
arXiv Detail & Related papers (2024-07-01T17:35:56Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - Audio-Synchronized Visual Animation [20.587868119296395]
We introduce Audio Synchronized Visual Animation (ASVA), a task animating a static image to demonstrate motion dynamics.
We present AVSync15, a dataset curated from VGGSound with videos featuring synchronized audio visual events across 15 categories.
We also present a diffusion model, AVSyncD, capable of generating dynamic animations guided by audios.
arXiv Detail & Related papers (2024-03-08T20:17:34Z) - Synchformer: Efficient Synchronization from Sparse Cues [100.89656994681934]
Our contributions include a novel audio-visual synchronization model, and training that decouples extraction from synchronization modelling.
This approach achieves state-of-the-art performance in both dense and sparse settings.
We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
arXiv Detail & Related papers (2024-01-29T18:59:55Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion
Models [12.898486592791604]
We present Diff-Foley, a synchronized Video-to-Audio synthesis method with a latent diffusion model (LDM)
We show Diff-Foley achieves state-of-the-art V2A performance on current large scale V2A dataset.
arXiv Detail & Related papers (2023-06-29T12:39:58Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z) - VarietySound: Timbre-Controllable Video to Sound Generation via
Unsupervised Information Disentanglement [68.42632589736881]
We pose the task of generating sound with a specific timbre given a video input and a reference audio sample.
To solve this task, we disentangle each target sound audio into three components: temporal information, acoustic information, and background information.
Our method can generate high-quality audio samples with good synchronization with events in video and high timbre similarity with the reference audio.
arXiv Detail & Related papers (2022-11-19T11:12:01Z) - Sparse in Space and Time: Audio-visual Synchronisation with Trainable
Selectors [103.21152156339484]
The objective of this paper is audio-visual synchronisation of general videos 'in the wild'
We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs'selectors'
We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task.
arXiv Detail & Related papers (2022-10-13T14:25:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.