Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling?
- URL: http://arxiv.org/abs/2406.08800v2
- Date: Thu, 29 Aug 2024 09:27:45 GMT
- Title: Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling?
- Authors: Tiantian Feng, Dimitrios Dimitriadis, Shrikanth Narayanan,
- Abstract summary: We aim to evaluate the quality of audio generation by examining the effectiveness of using them as training data.
Specifically, we conduct studies to explore the use of synthetic audio for audio recognition.
We also investigate whether synthetic audio can serve as a resource for data augmentation in speech-related modeling.
- Score: 40.3708221702947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in foundation models have enabled audio-generative models that produce high-fidelity sounds associated with music, events, and human actions. Despite the success achieved in modern audio-generative models, the conventional approach to assessing the quality of the audio generation relies heavily on distance metrics like Frechet Audio Distance. In contrast, we aim to evaluate the quality of audio generation by examining the effectiveness of using them as training data. Specifically, we conduct studies to explore the use of synthetic audio for audio recognition. Moreover, we investigate whether synthetic audio can serve as a resource for data augmentation in speech-related modeling. Our comprehensive experiments demonstrate the potential of using synthetic audio for audio recognition and speech-related modeling. Our code is available at https://github.com/usc-sail/SynthAudio.
Related papers
- EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos [3.6078215038168473]
We introduce EgoSonics, a method to generate semantically meaningful and synchronized audio tracks conditioned on silent egocentric videos.
generating audio for silent egocentric videos could open new applications in virtual reality, assistive technologies, or for augmenting existing datasets.
arXiv Detail & Related papers (2024-07-30T06:57:00Z) - Audiobox: Unified Audio Generation with Natural Language Prompts [37.39834044113061]
This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities.
We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms.
Audiobox sets new benchmarks on speech and sound generation and unlocks new methods for generating audio with novel vocal and acoustic styles.
arXiv Detail & Related papers (2023-12-25T22:24:49Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - HiddenSinger: High-Quality Singing Voice Synthesis via Neural Audio
Codec and Latent Diffusion Models [25.966328901566815]
We propose HiddenSinger, a high-quality singing voice synthesis system using neural audio and latent diffusion models.
In addition, our proposed model is extended to an unsupervised singing voice learning framework, HiddenSinger-U, to train the model.
Experimental results demonstrate that our model outperforms previous models in terms of audio quality.
arXiv Detail & Related papers (2023-06-12T01:21:41Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data [9.072124914105325]
We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings.
Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model.
arXiv Detail & Related papers (2020-05-29T01:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.