FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds
- URL: http://arxiv.org/abs/2407.01494v1
- Date: Mon, 1 Jul 2024 17:35:56 GMT
- Title: FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds
- Authors: Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen,
- Abstract summary: We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience.
We propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation.
One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents.
- Score: 14.636030346325578
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience. Despite its wide range of applications, existing approaches encounter limitations when it comes to simultaneously synthesizing high-quality and video-aligned (i.e.,, semantic relevant and temporal synchronized) sounds. To overcome these limitations, we propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation. FoleyCrafter comprises two key components: the semantic adapter for semantic alignment and the temporal controller for precise audio-video synchronization. The semantic adapter utilizes parallel cross-attention layers to condition audio generation on video features, producing realistic sound effects that are semantically relevant to the visual content. Meanwhile, the temporal controller incorporates an onset detector and a timestampbased adapter to achieve precise audio-video alignment. One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents. We conduct extensive quantitative and qualitative experiments on standard benchmarks to verify the effectiveness of FoleyCrafter. Models and codes are available at https://github.com/open-mmlab/FoleyCrafter.
Related papers
- Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity [12.848371604063168]
We propose a V2A generative model, named MaskVAT, that interconnects a full-band high-quality general audio with a sequence-to-sequence masked generative model.
Our results show that, by combining a high-quality with the proper pre-trained audio-visual features and a sequence-to-sequence parallel structure, we are able to yield highly synchronized results.
arXiv Detail & Related papers (2024-07-15T01:49:59Z) - Text-to-Audio Generation Synchronized with Videos [44.848393652233796]
We introduce a groundbreaking benchmark for Text-to-Audio generation that aligns with Videos, named T2AV-Bench.
We also present a simple yet effective video-aligned TTA generation model, namely T2AV.
It employs a temporal multi-head attention transformer to extract and understand temporal nuances from video data, a feat amplified by our Audio-Visual ControlNet.
arXiv Detail & Related papers (2024-03-08T22:27:38Z) - SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis [9.118448725265669]
One of the most time-consuming steps when designing sound is synchronizing audio with video.
In video games and animations, no reference audio exists, requiring manual annotation of event timings from the video.
We propose a system to extract repetitive actions onsets from a video, which are then used to condition a diffusion model trained to generate a new synchronized sound effects audio track.
arXiv Detail & Related papers (2023-10-23T18:01:36Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - CATR: Combinatorial-Dependence Audio-Queried Transformer for
Audio-Visual Video Segmentation [43.562848631392384]
Audio-visual video segmentation aims to generate pixel-level maps of sound-producing objects within image frames.
We propose a decoupled audio-video dependence combining audio and video features from their respective temporal and spatial dimensions.
arXiv Detail & Related papers (2023-09-18T12:24:02Z) - The Power of Sound (TPoS): Audio Reactive Video Generation with Stable
Diffusion [23.398304611826642]
We propose The Power of Sound model to incorporate audio input that includes both changeable temporal semantics and magnitude.
To generate video frames, TPoS utilizes a latent stable diffusion model with semantic information, which is then guided by the sequential audio embedding.
We demonstrate the effectiveness of TPoS across various tasks and compare its results with current state-of-the-art techniques in the field of audio-to-video generation.
arXiv Detail & Related papers (2023-09-08T12:21:01Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - VarietySound: Timbre-Controllable Video to Sound Generation via
Unsupervised Information Disentanglement [68.42632589736881]
We pose the task of generating sound with a specific timbre given a video input and a reference audio sample.
To solve this task, we disentangle each target sound audio into three components: temporal information, acoustic information, and background information.
Our method can generate high-quality audio samples with good synchronization with events in video and high timbre similarity with the reference audio.
arXiv Detail & Related papers (2022-11-19T11:12:01Z) - Sparse in Space and Time: Audio-visual Synchronisation with Trainable
Selectors [103.21152156339484]
The objective of this paper is audio-visual synchronisation of general videos 'in the wild'
We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs'selectors'
We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task.
arXiv Detail & Related papers (2022-10-13T14:25:37Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - End to End Lip Synchronization with a Temporal AutoEncoder [95.94432031144716]
We study the problem of syncing the lip movement in a video with the audio stream.
Our solution finds an optimal alignment using a dual-domain recurrent neural network.
As an application, we demonstrate our ability to robustly align text-to-speech generated audio with an existing video stream.
arXiv Detail & Related papers (2022-03-30T12:00:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.