A Proposal for Foley Sound Synthesis Challenge
- URL: http://arxiv.org/abs/2207.10760v1
- Date: Thu, 21 Jul 2022 21:19:07 GMT
- Title: A Proposal for Foley Sound Synthesis Challenge
- Authors: Keunwoo Choi, Sangshin Oh, Minsung Kang, Brian McFee
- Abstract summary: "Foley" refers to sound effects that are added to multimedia during post-production to enhance its perceived acoustic properties.
We propose a challenge for automatic foley synthesis.
- Score: 7.469200949273274
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: "Foley" refers to sound effects that are added to multimedia during
post-production to enhance its perceived acoustic properties, e.g., by
simulating the sounds of footsteps, ambient environmental sounds, or visible
objects on the screen. While foley is traditionally produced by foley artists,
there is increasing interest in automatic or machine-assisted techniques
building upon recent advances in sound synthesis and generative models. To
foster more participation in this growing research area, we propose a challenge
for automatic foley synthesis. Through case studies on successful previous
challenges in audio and machine learning, we set the goals of the proposed
challenge: rigorous, unified, and efficient evaluation of different foley
synthesis systems, with an overarching goal of drawing active participation
from the research community. We outline the details and design considerations
of a foley sound synthesis challenge, including task definition, dataset
requirements, and evaluation criteria.
Related papers
- Sound Scene Synthesis at the DCASE 2024 Challenge [8.170174172545831]
This paper presents Task 7 at the DCASE 2024 Challenge: sound scene synthesis.
Recent advances in sound synthesis and generative models have enabled the creation of realistic and diverse audio content.
We introduce a standardized evaluation framework for comparing different sound scene synthesis systems, incorporating both objective and subjective metrics.
arXiv Detail & Related papers (2025-01-15T05:15:54Z) - Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation [8.170174172545831]
This paper addresses issues through the Sound Scene Synthesis challenge held as part of the Detection and Classification of Acoustic Scenes and Events 2024.
We present an evaluation protocol combining objective metric, namely Fr'echet Audio Distance, with perceptual assessments, utilizing a structured prompt format to enable diverse captions and effective evaluation.
arXiv Detail & Related papers (2024-10-23T06:35:41Z) - Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [55.2480439325792]
Large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information.
These models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources.
arXiv Detail & Related papers (2024-10-21T15:55:27Z) - SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities.
RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z) - T-FOLEY: A Controllable Waveform-Domain Diffusion Model for
Temporal-Event-Guided Foley Sound Synthesis [7.529080653700932]
We present T-Foley, a Temporal-event-guided waveform generation model for Foley sound synthesis.
T-Foley generates high-quality audio using two conditions: the sound class and temporal event feature.
T-Foley achieves superior performance in both objective and subjective evaluation metrics.
arXiv Detail & Related papers (2024-01-17T15:54:36Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Novel-View Acoustic Synthesis [140.1107768313269]
We introduce the novel-view acoustic synthesis (NVAS) task.
given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint?
We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space.
arXiv Detail & Related papers (2023-01-20T18:49:58Z) - FoleyGAN: Visually Guided Generative Adversarial Network-Based
Synchronous Sound Generation in Silent Videos [0.0]
We introduce a novel task of guiding a class conditioned generative adversarial network with the temporal visual information of a video input for visual to sound generation task.
Our proposed FoleyGAN model is capable of conditioning action sequences of visual events leading towards generating visually aligned realistic sound tracks.
arXiv Detail & Related papers (2021-07-20T04:59:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.