Killing two birds with one stone: Can an audio captioning system also be
used for audio-text retrieval?
- URL: http://arxiv.org/abs/2308.15090v1
- Date: Tue, 29 Aug 2023 07:53:17 GMT
- Title: Killing two birds with one stone: Can an audio captioning system also be
used for audio-text retrieval?
- Authors: Etienne Labb\'e (IRIT-SAMoVA), Thomas Pellegrini (IRIT-SAMoVA), Julien
Pinquier (IRIT-SAMoVA)
- Abstract summary: This work investigates the relationship between Audio-Text Retrieval (ATR) and Automated Audio Captioning (AAC)
For ATR, we propose using the standard Cross-Entropy loss values obtained for any audio/caption pair.
Experimental results on the Clotho and AudioCaps datasets demonstrate decent recall values using this simple approach.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated Audio Captioning (AAC) aims to develop systems capable of
describing an audio recording using a textual sentence. In contrast, Audio-Text
Retrieval (ATR) systems seek to find the best matching audio recording(s) for a
given textual query (Text-to-Audio) or vice versa (Audio-to-Text). These tasks
require different types of systems: AAC employs a sequence-to-sequence model,
while ATR utilizes a ranking model that compares audio and text representations
within a shared projection subspace. However, this work investigates the
relationship between AAC and ATR by exploring the ATR capabilities of an
unmodified AAC system, without fine-tuning for the new task. Our AAC system
consists of an audio encoder (ConvNeXt-Tiny) trained on AudioSet for audio
tagging, and a transformer decoder responsible for generating sentences. For
AAC, it achieves a high SPIDEr-FL score of 0.298 on Clotho and 0.472 on
AudioCaps on average. For ATR, we propose using the standard Cross-Entropy loss
values obtained for any audio/caption pair. Experimental results on the Clotho
and AudioCaps datasets demonstrate decent recall values using this simple
approach. For instance, we obtained a Text-to-Audio R@1 value of 0.382 for
Au-dioCaps, which is above the current state-of-the-art method without external
data. Interestingly, we observe that normalizing the loss values was necessary
for Audio-to-Text retrieval.
Related papers
- Retrieval-Augmented Text-to-Audio Generation [36.328134891428085]
We show that the state-of-the-art models, such as AudioLDM, are biased in their generation performance.
We propose a simple retrieval-augmented approach for TTA models.
We show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types.
arXiv Detail & Related papers (2023-09-14T22:35:39Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Separate Anything You Describe [55.0784713558149]
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA)
AudioSep is a foundation model for open-domain audio source separation with natural language queries.
arXiv Detail & Related papers (2023-08-09T16:09:44Z) - Efficient Audio Captioning Transformer with Patchout and Text Guidance [74.59739661383726]
We propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting.
The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model.
Our proposed method received the Judges Award at the Task6A of DCASE Challenge 2022.
arXiv Detail & Related papers (2023-04-06T07:58:27Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Interactive Audio-text Representation for Automated Audio Captioning
with Contrastive Learning [25.06635361326706]
We propose a novel AAC system called CLIP-AAC to learn interactive cross-modality representation.
The proposed CLIP-AAC introduces an audio-head and a text-head in the pre-trained encoder to extract audio-text information.
We also apply contrastive learning to narrow the domain difference by learning the correspondence between the audio signal and its paired captions.
arXiv Detail & Related papers (2022-03-29T13:06:46Z) - Connecting the Dots between Audio and Text without Parallel Data through
Visual Knowledge Transfer [40.85506152074302]
VIP-ANT induces textbfAudio-textbfText alignment without using parallel audio-text data.
Our work opens up new avenues for learning audio-text connections with little to no parallel audio-text data.
arXiv Detail & Related papers (2021-12-16T16:22:10Z) - CL4AC: A Contrastive Loss for Audio Captioning [43.83939284740561]
We propose a novel encoder-decoder framework called Contrastive Loss for Audio Captioning (CL4AC)
In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts.
Experiments are performed on the Clotho dataset to show the effectiveness of our proposed approach.
arXiv Detail & Related papers (2021-07-21T10:13:02Z) - VGGSound: A Large-scale Audio-Visual Dataset [160.1604237188594]
We propose a scalable pipeline to create an audio dataset from open-source media.
We use this pipeline to curate the VGGSound dataset consisting of more than 210k videos for 310 audio classes.
The resulting dataset can be used for training and evaluating audio recognition models.
arXiv Detail & Related papers (2020-04-29T17:46:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.