Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound
- URL: http://arxiv.org/abs/2408.11915v1
- Date: Wed, 21 Aug 2024 18:06:15 GMT
- Title: Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound
- Authors: Junwon Lee, Jaekwon Im, Dabin Kim, Juhan Nam,
- Abstract summary: Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video both temporally and semantically.
Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges.
We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as a temporal event condition with semantic timbre prompts.
- Score: 6.638504164134713
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video both temporally and semantically. Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges. Systems lacking explicit temporal features suffer from poor controllability and alignment, while timestamp-based models require costly and subjective human annotation. We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as a temporal event condition with semantic timbre prompts (audio or text). RMS, a frame-level intensity envelope feature closely related to audio semantics, ensures high controllability and synchronization. The annotation-free self-supervised learning framework consists of two stages, Video2RMS and RMS2Sound, incorporating novel ideas including RMS discretization and RMS-ControlNet with a pretrained text-to-audio model. Our extensive evaluation shows that Video-Foley achieves state-of-the-art performance in audio-visual alignment and controllability for sound timing, intensity, timbre, and nuance. Code, model weights, and demonstrations are available on the accompanying website. (https://jnwnlee.github.io/video-foley-demo)
Related papers
- Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance [20.673800900456467]
We propose Smooth-Foley, a V2A generative model taking semantic guidance from the textual label across the generation to enhance both semantic and temporal alignment in audio.
A frame adapter integrates high-resolution frame-wise video features while a temporal adapter integrates temporal conditions obtained from similarities of visual frames and textual labels.
Results show that Smooth-Foley performs better than existing models on both continuous sound scenarios and general scenarios.
arXiv Detail & Related papers (2024-12-24T04:29:46Z) - Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis [56.01110988816489]
We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio.
MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples.
MMAudio achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance.
arXiv Detail & Related papers (2024-12-19T18:59:55Z) - Stable-V2A: Synthesis of Synchronized Sound Effects with Temporal and Semantic Controls [11.796771978828403]
Sound designers and Foley artists usually sonorize a scene, such as from a movie or video game, by manually annotating and sonorizing each action of interest in the video.
We present a two-stage model consisting of: an RMS-Mapper that estimates an envelope representative of the audio characteristics associated with the input video; and Stable-Foley, a diffusion model that generates audio semantically and temporally aligned with the target video.
arXiv Detail & Related papers (2024-12-19T16:37:19Z) - YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls [10.429203168607147]
YingSound is a foundation model designed for video-guided sound generation.
It supports high-quality audio generation in few-shot settings.
We show that YingSound effectively generates high-quality synchronized sounds through automated evaluations and human studies.
arXiv Detail & Related papers (2024-12-12T10:55:57Z) - Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.
In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4.
For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with the generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z) - Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis [28.172213291270868]
Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects to silent films or videos to enhance the auditory experience.
Video-to-Audio (V2A) presents inherent challenges related to audio-visual synchronization.
We construct a controllable video-to-audio model, termed Draw an Audio, which supports multiple input instructions through drawn masks and loudness signals.
arXiv Detail & Related papers (2024-09-10T01:07:20Z) - FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds [14.636030346325578]
We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience.
We propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation.
One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents.
arXiv Detail & Related papers (2024-07-01T17:35:56Z) - Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities [67.89368528234394]
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities.
Video and audio are obtained at much higher rates than text and are roughly aligned in time.
Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models.
arXiv Detail & Related papers (2023-11-09T19:15:12Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z) - Generating Visually Aligned Sound from Videos [83.89485254543888]
We focus on the task of generating sound from natural videos.
The sound should be both temporally and content-wise aligned with visual signals.
Some sounds generated outside of a camera can not be inferred from video content.
arXiv Detail & Related papers (2020-07-14T07:51:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.