FALL-E: A Foley Sound Synthesis Model and Strategies
- URL: http://arxiv.org/abs/2306.09807v2
- Date: Thu, 10 Aug 2023 05:13:33 GMT
- Title: FALL-E: A Foley Sound Synthesis Model and Strategies
- Authors: Minsung Kang, Sangshin Oh, Hyeongi Moon, Kyungyun Lee, Ben Sangbae
Chon
- Abstract summary: The FALL-E model employs a cascaded approach comprising low-resolution spectrogram generation, spectrogram super-resolution, and a vocoder.
We conditioned the model with dataset-specific texts, enabling it to learn sound quality and recording environment based on text input.
- Score: 0.5599792629509229
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper introduces FALL-E, a foley synthesis system and its
training/inference strategies. The FALL-E model employs a cascaded approach
comprising low-resolution spectrogram generation, spectrogram super-resolution,
and a vocoder. We trained every sound-related model from scratch using our
extensive datasets, and utilized a pre-trained language model. We conditioned
the model with dataset-specific texts, enabling it to learn sound quality and
recording environment based on text input. Moreover, we leveraged external
language models to improve text descriptions of our datasets and performed
prompt engineering for quality, coherence, and diversity. FALL-E was evaluated
by an objective measure as well as listening tests in the DCASE 2023 challenge
Task 7. The submission achieved the second place on average, while achieving
the best score for diversity, second place for audio quality, and third place
for class fitness.
Related papers
- Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision [50.23246260804145]
We introduce textbfNexus-O, an industry-level textbfomni-perceptive and -interactive model capable of efficiently processing Audio, Image, Video, and Text data.
We address three key research questions: First, how can models be efficiently designed and trained to achieve tri-modal alignment, understanding and reasoning capabilities across multiple modalities?
Second, what approaches can be implemented to evaluate tri-modal model robustness, ensuring reliable performance and applicability in real-world scenarios?
Third, what strategies can be employed to curate and obtain high-quality, real-life scenario
arXiv Detail & Related papers (2025-02-26T17:26:36Z) - ETTA: Elucidating the Design Space of Text-to-Audio Models [33.831803213869605]
We study the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks.
We propose our best model dubbed Elucidated Text-To-Audio (ETTA)
ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data.
arXiv Detail & Related papers (2024-12-26T21:13:12Z) - Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.
To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization.
To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z) - Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs [3.8300818830608345]
Multi-modal contrastive learning strategies for audio and text have rapidly gained interest.
The ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research.
We propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL.
arXiv Detail & Related papers (2024-08-17T18:53:17Z) - C3LLM: Conditional Multimodal Content Generation Using Large Language Models [66.11184017840688]
We introduce C3LLM, a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together.
C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities.
Our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model.
arXiv Detail & Related papers (2024-05-25T09:10:12Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Text-Driven Foley Sound Generation With Latent Diffusion Model [33.4636070590045]
Foley sound generation aims to synthesise the background sound for multimedia content.
We propose a diffusion model based system for Foley sound generation with text conditions.
arXiv Detail & Related papers (2023-06-17T14:16:24Z) - Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills.
Recent developments have enabled the use of more naturalistic training data for computational models.
It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z) - ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event
Classification [42.95038619688867]
ASiT is a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation.
We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification.
arXiv Detail & Related papers (2022-11-23T18:21:09Z) - Matching Text and Audio Embeddings: Exploring Transfer-learning
Strategies for Language-based Audio Retrieval [11.161404854726348]
We present an analysis of large-scale pretrained deep learning models used for cross-modal (text-to-audio) retrieval.
We use embeddings extracted by these models in a metric learning framework to connect matching pairs of audio and text.
arXiv Detail & Related papers (2022-10-06T11:45:14Z) - Self-supervised Graphs for Audio Representation Learning with Limited
Labeled Data [24.608764078208953]
Subgraphs are constructed by sampling the entire pool of available training data to exploit the relationship between labelled and unlabeled audio samples.
We evaluate our model on three benchmark audio databases, and two tasks: acoustic event detection and speech emotion recognition.
Our model is compact (240k parameters), and can produce generalized audio representations that are robust to different types of signal noise.
arXiv Detail & Related papers (2022-01-31T21:32:22Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.