Learning Audio Concepts from Counterfactual Natural Language
- URL: http://arxiv.org/abs/2401.04935v1
- Date: Wed, 10 Jan 2024 05:15:09 GMT
- Title: Learning Audio Concepts from Counterfactual Natural Language
- Authors: Ali Vosoughi, Luca Bondi, Ho-Hsiang Wu, Chenliang Xu
- Abstract summary: This study introduces causal reasoning and counterfactual analysis in the audio domain.
Our model considers acoustic characteristics and sound source information from human-annotated reference texts.
Specifically, the top-1 accuracy in open-ended language-based audio retrieval task increased by more than 43%.
- Score: 34.118579918018725
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conventional audio classification relied on predefined classes, lacking the
ability to learn from free-form text. Recent methods unlock learning joint
audio-text embeddings from raw audio-text pairs describing audio in natural
language. Despite recent advancements, there is little exploration of
systematic methods to train models for recognizing sound events and sources in
alternative scenarios, such as distinguishing fireworks from gunshots at
outdoor events in similar situations. This study introduces causal reasoning
and counterfactual analysis in the audio domain. We use counterfactual
instances and include them in our model across different aspects. Our model
considers acoustic characteristics and sound source information from
human-annotated reference texts. To validate the effectiveness of our model, we
conducted pre-training utilizing multiple audio captioning datasets. We then
evaluate with several common downstream tasks, demonstrating the merits of the
proposed method as one of the first works leveraging counterfactual information
in audio domain. Specifically, the top-1 accuracy in open-ended language-based
audio retrieval task increased by more than 43%.
Related papers
- Multimodal Input Aids a Bayesian Model of Phonetic Learning [0.6827423171182154]
We introduce a method for creating high-quality synthetic videos of speakers' faces for an existing audio corpus.
Our learning model, when both trained and tested on audiovisual inputs, achieves up to a 8.1% relative improvement on a phoneme discrimination battery.
Visual information is especially beneficial in noisy audio environments.
arXiv Detail & Related papers (2024-07-22T19:00:11Z) - Natural language guidance of high-fidelity text-to-speech with synthetic
annotations [13.642358232817342]
We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions.
We then apply this method to a 45k hour dataset, which we use to train a speech language model.
Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions.
arXiv Detail & Related papers (2024-02-02T21:29:34Z) - AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining [46.22290575167155]
This paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation.
Our framework introduces a general representation of audio, called "language of audio" (LOA)
arXiv Detail & Related papers (2023-08-10T17:55:13Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Joint Speech Recognition and Audio Captioning [37.205642807313545]
Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources.
We aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR)
We propose several approaches for end-to-end joint modeling of ASR and AAC tasks.
arXiv Detail & Related papers (2022-02-03T04:42:43Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Audio Captioning using Pre-Trained Large-Scale Language Model Guided by
Audio-based Similar Caption Retrieval [28.57294189207084]
The goal of audio captioning is to translate input audio into its description using natural language.
The proposed method has succeeded to use a pre-trained language model for audio captioning.
The oracle performance of the pre-trained model-based caption generator was clearly better than that of the conventional method trained from scratch.
arXiv Detail & Related papers (2020-12-14T08:27:36Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.