Audio-text Retrieval in Context
- URL: http://arxiv.org/abs/2203.13645v2
- Date: Tue, 29 Mar 2022 04:32:47 GMT
- Title: Audio-text Retrieval in Context
- Authors: Siyu Lou, Xuenan Xu, Mengyue Wu, Kai Yu
- Abstract summary: In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment.
We build our contextual audio-text retrieval system using pre-trained audio features and a descriptor-based aggregation method.
With our proposed system, a significant improvement has been achieved on bidirectional audio-text retrieval, on all metrics including recall, median and mean rank.
- Score: 24.38055340045366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-text retrieval based on natural language descriptions is a challenging
task. It involves learning cross-modality alignments between long sequences
under inadequate data conditions. In this work, we investigate several audio
features as well as sequence aggregation methods for better audio-text
alignment. Moreover, through a qualitative analysis we observe that semantic
mapping is more important than temporal relations in contextual retrieval.
Using pre-trained audio features and a descriptor-based aggregation method, we
build our contextual audio-text retrieval system. Specifically, we utilize
PANNs features pre-trained on a large sound event dataset and NetRVLAD pooling,
which directly works with averaged descriptors. Experiments are conducted on
the AudioCaps and CLOTHO datasets, and results are compared with the previous
state-of-the-art system. With our proposed system, a significant improvement
has been achieved on bidirectional audio-text retrieval, on all metrics
including recall, median and mean rank.
Related papers
- Audio Captioning via Generative Pair-to-Pair Retrieval with Refined Knowledge Base [0.0]
Retrieval-Augmented Generation (RAG) retrieves audio-text pairs from a knowledge base and augments them with query audio to generate accurate textual responses.
We propose generative pair-to-pair retrieval, which uses the generated caption as a text query to accurately find relevant audio-text pairs.
Our approach achieves state-of-the-art results on benchmarks including AudioCaps, Clotho, and Auto-ACD.
arXiv Detail & Related papers (2024-10-14T04:57:32Z) - Dissecting Temporal Understanding in Text-to-Audio Retrieval [22.17493527005141]
We analyse the temporal ordering of sounds, which is an understudied problem in the context of text-to-audio retrieval.
In particular, we dissect the temporal understanding capabilities of a state-of-the-art model for text-to-audio retrieval on the AudioCaps and Clotho datasets.
We present a loss function that encourages text-audio models to focus on the temporal ordering of events.
arXiv Detail & Related papers (2024-09-01T22:01:21Z) - Distance Sampling-based Paraphraser Leveraging ChatGPT for Text Data Manipulation [15.765495448426904]
We propose a novel approach to tackle the data imbalance problem in audio-language retrieval task.
A distance sampling-based paraphraser leveraging ChatGPT generates a controllable distribution of manipulated text data.
The proposed approach is shown to significantly enhance performance in audio-text retrieval, outperforming conventional text augmentation techniques.
arXiv Detail & Related papers (2024-05-01T07:44:28Z) - Advancing Natural-Language Based Audio Retrieval with PaSST and Large
Audio-Caption Data Sets [6.617487928813374]
We present a text-to-audio-retrieval system based on pre-trained text and spectrogram transformers.
Our system ranked first in the 2023's DCASE Challenge, and it outperforms the current state of the art on the ClothoV2 benchmark by 5.6 pp. mAP@10.
arXiv Detail & Related papers (2023-08-08T13:46:55Z) - Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation [72.7915031238824]
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks.
They often suffer from common issues such as semantic misalignment and poor temporal consistency.
We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
arXiv Detail & Related papers (2023-05-29T10:41:28Z) - WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research [82.42802570171096]
We introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.
Online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning.
We propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
arXiv Detail & Related papers (2023-03-30T14:07:47Z) - ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition [100.30565531246165]
Speech recognition systems require dataset-specific tuning.
This tuning requirement can lead to systems failing to generalise to other datasets and domains.
We introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition system.
arXiv Detail & Related papers (2022-10-24T15:58:48Z) - Separate What You Describe: Language-Queried Audio Source Separation [53.65665794338574]
We introduce the task of language-queried audio source separation (LASS)
LASS aims to separate a target source from an audio mixture based on a natural language query of the target source.
We propose LASS-Net, an end-to-end neural network that is learned to jointly process acoustic and linguistic information.
arXiv Detail & Related papers (2022-03-28T23:47:57Z) - Unsupervised Cross-Modal Audio Representation Learning from Unstructured
Multilingual Text [69.55642178336953]
We present an approach to unsupervised audio representation learning.
Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness.
We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection.
arXiv Detail & Related papers (2020-03-27T07:37:15Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.