T-FOLEY: A Controllable Waveform-Domain Diffusion Model for
Temporal-Event-Guided Foley Sound Synthesis
- URL: http://arxiv.org/abs/2401.09294v1
- Date: Wed, 17 Jan 2024 15:54:36 GMT
- Title: T-FOLEY: A Controllable Waveform-Domain Diffusion Model for
Temporal-Event-Guided Foley Sound Synthesis
- Authors: Yoonjin Chung, Junwon Lee, Juhan Nam
- Abstract summary: We present T-Foley, a Temporal-event-guided waveform generation model for Foley sound synthesis.
T-Foley generates high-quality audio using two conditions: the sound class and temporal event feature.
T-Foley achieves superior performance in both objective and subjective evaluation metrics.
- Score: 7.529080653700932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Foley sound, audio content inserted synchronously with videos, plays a
critical role in the user experience of multimedia content. Recently, there has
been active research in Foley sound synthesis, leveraging the advancements in
deep generative models. However, such works mainly focus on replicating a
single sound class or a textual sound description, neglecting temporal
information, which is crucial in the practical applications of Foley sound. We
present T-Foley, a Temporal-event-guided waveform generation model for Foley
sound synthesis. T-Foley generates high-quality audio using two conditions: the
sound class and temporal event feature. For temporal conditioning, we devise a
temporal event feature and a novel conditioning technique named Block-FiLM.
T-Foley achieves superior performance in both objective and subjective
evaluation metrics and generates Foley sound well-synchronized with the
temporal events. Additionally, we showcase T-Foley's practical applications,
particularly in scenarios involving vocal mimicry for temporal event control.
We show the demo on our companion website.
Related papers
- Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance [20.673800900456467]
We propose Smooth-Foley, a V2A generative model taking semantic guidance from the textual label across the generation to enhance both semantic and temporal alignment in audio.
A frame adapter integrates high-resolution frame-wise video features while a temporal adapter integrates temporal conditions obtained from similarities of visual frames and textual labels.
Results show that Smooth-Foley performs better than existing models on both continuous sound scenarios and general scenarios.
arXiv Detail & Related papers (2024-12-24T04:29:46Z) - AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation [49.6922496382879]
AV-Link is a unified framework for Video-to-Audio and Audio-to-Video generation.
We propose a Fusion Block that enables bidirectional information exchange between our backbone video and audio diffusion models.
We evaluate our design choices and demonstrate the ability of our method to achieve synchronized and high-quality audiovisual content.
arXiv Detail & Related papers (2024-12-19T18:57:21Z) - Stable-V2A: Synthesis of Synchronized Sound Effects with Temporal and Semantic Controls [11.796771978828403]
Sound designers and Foley artists usually sonorize a scene, such as from a movie or video game, by manually annotating and sonorizing each action of interest in the video.
We present a two-stage model consisting of: an RMS-Mapper that estimates an envelope representative of the audio characteristics associated with the input video; and Stable-Foley, a diffusion model that generates audio semantically and temporally aligned with the target video.
arXiv Detail & Related papers (2024-12-19T16:37:19Z) - Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound [6.638504164134713]
Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video both temporally and semantically.
Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges.
We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as a temporal event condition with semantic timbre prompts.
arXiv Detail & Related papers (2024-08-21T18:06:15Z) - FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds [14.636030346325578]
We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience.
We propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation.
One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents.
arXiv Detail & Related papers (2024-07-01T17:35:56Z) - Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Audio-Visual Contrastive Learning with Temporal Self-Supervision [84.11385346896412]
We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision.
To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting.
arXiv Detail & Related papers (2023-02-15T15:00:55Z) - FoleyGAN: Visually Guided Generative Adversarial Network-Based
Synchronous Sound Generation in Silent Videos [0.0]
We introduce a novel task of guiding a class conditioned generative adversarial network with the temporal visual information of a video input for visual to sound generation task.
Our proposed FoleyGAN model is capable of conditioning action sequences of visual events leading towards generating visually aligned realistic sound tracks.
arXiv Detail & Related papers (2021-07-20T04:59:26Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.