Related papers: FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

URL: http://arxiv.org/abs/2407.01494v1
Date: Mon, 1 Jul 2024 17:35:56 GMT
Title: FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds
Authors: Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen,
Abstract summary: We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience. We propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation. One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents.
Score: 14.636030346325578
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience. Despite its wide range of applications, existing approaches encounter limitations when it comes to simultaneously synthesizing high-quality and video-aligned (i.e.,, semantic relevant and temporal synchronized) sounds. To overcome these limitations, we propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation. FoleyCrafter comprises two key components: the semantic adapter for semantic alignment and the temporal controller for precise audio-video synchronization. The semantic adapter utilizes parallel cross-attention layers to condition audio generation on video features, producing realistic sound effects that are semantically relevant to the visual content. Meanwhile, the temporal controller incorporates an onset detector and a timestampbased adapter to achieve precise audio-video alignment. One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents. We conduct extensive quantitative and qualitative experiments on standard benchmarks to verify the effectiveness of FoleyCrafter. Models and codes are available at https://github.com/open-mmlab/FoleyCrafter.

Related papers

Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance [20.673800900456467]
We propose Smooth-Foley, a V2A generative model taking semantic guidance from the textual label across the generation to enhance both semantic and temporal alignment in audio. A frame adapter integrates high-resolution frame-wise video features while a temporal adapter integrates temporal conditions obtained from similarities of visual frames and textual labels. Results show that Smooth-Foley performs better than existing models on both continuous sound scenarios and general scenarios.
arXiv Detail & Related papers (2024-12-24T04:29:46Z)
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis [56.01110988816489]
We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. MMAudio achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance.
arXiv Detail & Related papers (2024-12-19T18:59:55Z)
AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation [49.6922496382879]
We propose a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation. The key to our framework is a Fusion Block that facilitates bidirectional information exchange between video and audio diffusion models.
arXiv Detail & Related papers (2024-12-19T18:57:21Z)
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs. For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z)
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis [28.172213291270868]
Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects to silent films or videos to enhance the auditory experience. Video-to-Audio (V2A) presents inherent challenges related to audio-visual synchronization. We construct a controllable video-to-audio model, termed Draw an Audio, which supports multiple input instructions through drawn masks and loudness signals.
arXiv Detail & Related papers (2024-09-10T01:07:20Z)
Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound [6.638504164134713]
Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video both temporally and semantically. Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges. We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as a temporal event condition with semantic timbre prompts.
arXiv Detail & Related papers (2024-08-21T18:06:15Z)
Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity [12.848371604063168]
We propose a V2A generative model, named MaskVAT, that interconnects a full-band high-quality general audio with a sequence-to-sequence masked generative model. Our results show that, by combining a high-quality with the proper pre-trained audio-visual features and a sequence-to-sequence parallel structure, we are able to yield highly synchronized results.
arXiv Detail & Related papers (2024-07-15T01:49:59Z)
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z)
CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation [43.562848631392384]
Audio-visual video segmentation aims to generate pixel-level maps of sound-producing objects within image frames. We propose a decoupled audio-video dependence combining audio and video features from their respective temporal and spatial dimensions.
arXiv Detail & Related papers (2023-09-18T12:24:02Z)
VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement [68.42632589736881]
We pose the task of generating sound with a specific timbre given a video input and a reference audio sample. To solve this task, we disentangle each target sound audio into three components: temporal information, acoustic information, and background information. Our method can generate high-quality audio samples with good synchronization with events in video and high timbre similarity with the reference audio.
arXiv Detail & Related papers (2022-11-19T11:12:01Z)
Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors [103.21152156339484]
The objective of this paper is audio-visual synchronisation of general videos 'in the wild' We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs'selectors' We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task.
arXiv Detail & Related papers (2022-10-13T14:25:37Z)
AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z)
End to End Lip Synchronization with a Temporal AutoEncoder [95.94432031144716]
We study the problem of syncing the lip movement in a video with the audio stream. Our solution finds an optimal alignment using a dual-domain recurrent neural network. As an application, we demonstrate our ability to robustly align text-to-speech generated audio with an existing video stream.
arXiv Detail & Related papers (2022-03-30T12:00:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.