Advancing Natural-Language Based Audio Retrieval with PaSST and Large
Audio-Caption Data Sets
- URL: http://arxiv.org/abs/2308.04258v1
- Date: Tue, 8 Aug 2023 13:46:55 GMT
- Title: Advancing Natural-Language Based Audio Retrieval with PaSST and Large
Audio-Caption Data Sets
- Authors: Paul Primus, Khaled Koutini, Gerhard Widmer
- Abstract summary: We present a text-to-audio-retrieval system based on pre-trained text and spectrogram transformers.
Our system ranked first in the 2023's DCASE Challenge, and it outperforms the current state of the art on the ClothoV2 benchmark by 5.6 pp. mAP@10.
- Score: 6.617487928813374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work presents a text-to-audio-retrieval system based on pre-trained text
and spectrogram transformers. Our method projects recordings and textual
descriptions into a shared audio-caption space in which related examples from
different modalities are close. Through a systematic analysis, we examine how
each component of the system influences retrieval performance. As a result, we
identify two key components that play a crucial role in driving performance:
the self-attention-based audio encoder for audio embedding and the utilization
of additional human-generated and synthetic data sets during pre-training. We
further experimented with augmenting ClothoV2 captions with available keywords
to increase their variety; however, this only led to marginal improvements. Our
system ranked first in the 2023's DCASE Challenge, and it outperforms the
current state of the art on the ClothoV2 benchmark by 5.6 pp. mAP@10.
Related papers
- End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding [4.604877755214193]
Existing end-to-end piano A2S systems have been trained and evaluated with only synthetic data.
We propose a sequence-to-sequence (Seq2Seq) model with a hierarchical decoder that aligns with the hierarchical structure of musical scores.
We propose a two-stage training scheme, which involves pre-training the model using an expressive performance rendering system on synthetic audio, followed by fine-tuning the model using recordings of human performance.
arXiv Detail & Related papers (2024-05-22T10:52:04Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Text-Driven Foley Sound Generation With Latent Diffusion Model [33.4636070590045]
Foley sound generation aims to synthesise the background sound for multimedia content.
We propose a diffusion model based system for Foley sound generation with text conditions.
arXiv Detail & Related papers (2023-06-17T14:16:24Z) - Efficient Audio Captioning Transformer with Patchout and Text Guidance [74.59739661383726]
We propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting.
The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model.
Our proposed method received the Judges Award at the Task6A of DCASE Challenge 2022.
arXiv Detail & Related papers (2023-04-06T07:58:27Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - Audio-text Retrieval in Context [24.38055340045366]
In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment.
We build our contextual audio-text retrieval system using pre-trained audio features and a descriptor-based aggregation method.
With our proposed system, a significant improvement has been achieved on bidirectional audio-text retrieval, on all metrics including recall, median and mean rank.
arXiv Detail & Related papers (2022-03-25T13:41:17Z) - Evaluating Off-the-Shelf Machine Listening and Natural Language Models
for Automated Audio Captioning [16.977616651315234]
A captioning system has to identify various information from the input signal and express it with natural language.
We evaluate the performance of off-the-shelf models with a Transformer-based captioning approach.
arXiv Detail & Related papers (2021-10-14T14:42:38Z) - Effects of Word-frequency based Pre- and Post- Processings for Audio
Captioning [49.41766997393417]
The system we used for Task 6 (Automated Audio Captioning)of the Detection and Classification of Acoustic Scenes and Events(DCASE) 2020 Challenge combines three elements, namely, dataaugmentation, multi-task learning, and post-processing, for audiocaptioning.
The system received the highest evaluation scores, but which of the individual elements most fully contributed to its perfor-mance has not yet been clarified.
arXiv Detail & Related papers (2020-09-24T01:07:33Z) - Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture.
We focus on the scene context provided by the visual information, to ground the ASR.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.