Improving Natural-Language-based Audio Retrieval with Transfer Learning
and Audio & Text Augmentations
- URL: http://arxiv.org/abs/2208.11460v1
- Date: Wed, 24 Aug 2022 11:54:42 GMT
- Title: Improving Natural-Language-based Audio Retrieval with Transfer Learning
and Audio & Text Augmentations
- Authors: Paul Primus and Gerhard Widmer
- Abstract summary: We propose a system to project recordings and textual descriptions into a shared audio-caption space.
Our results show that the used augmentations strategies reduce overfitting and improve retrieval performance.
We further show that pre-training the system on the AudioCaps dataset leads to additional improvements.
- Score: 7.817685358710508
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The absence of large labeled datasets remains a significant challenge in many
application areas of deep learning. Researchers and practitioners typically
resort to transfer learning and data augmentation to alleviate this issue. We
study these strategies in the context of audio retrieval with natural language
queries (Task 6b of the DCASE 2022 Challenge). Our proposed system uses
pre-trained embedding models to project recordings and textual descriptions
into a shared audio-caption space in which related examples from different
modalities are close. We employ various data augmentation techniques on audio
and text inputs and systematically tune their corresponding hyperparameters
with sequential model-based optimization. Our results show that the used
augmentations strategies reduce overfitting and improve retrieval performance.
We further show that pre-training the system on the AudioCaps dataset leads to
additional improvements.
Related papers
- AudioSetMix: Enhancing Audio-Language Datasets with LLM-Assisted Augmentations [1.2101820447447276]
Multi-modal learning in the audio-language domain has seen significant advancements in recent years.
However, audio-language learning faces challenges due to limited and lower-quality data compared to image-language tasks.
Our method systematically generates audio-caption pairs by augmenting audio clips with natural language labels and corresponding audio signal processing operations.
This scalable method produces AudioSetMix, a high-quality training dataset for text-and-audio related models.
arXiv Detail & Related papers (2024-05-17T21:08:58Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation [72.7915031238824]
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks.
They often suffer from common issues such as semantic misalignment and poor temporal consistency.
We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
arXiv Detail & Related papers (2023-05-29T10:41:28Z) - A Whisper transformer for audio captioning trained with synthetic
captions and transfer learning [0.0]
We present our approach to audio captioning, focusing on the use of a pretrained speech-to-text Whisper model and pretraining on synthetic captions.
Our findings demonstrate the impact of different training strategies on the performance of the audio captioning model.
arXiv Detail & Related papers (2023-05-15T22:20:07Z) - Efficient Audio Captioning Transformer with Patchout and Text Guidance [74.59739661383726]
We propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting.
The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model.
Our proposed method received the Judges Award at the Task6A of DCASE Challenge 2022.
arXiv Detail & Related papers (2023-04-06T07:58:27Z) - ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event
Classification [42.95038619688867]
ASiT is a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation.
We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification.
arXiv Detail & Related papers (2022-11-23T18:21:09Z) - Matching Text and Audio Embeddings: Exploring Transfer-learning
Strategies for Language-based Audio Retrieval [11.161404854726348]
We present an analysis of large-scale pretrained deep learning models used for cross-modal (text-to-audio) retrieval.
We use embeddings extracted by these models in a metric learning framework to connect matching pairs of audio and text.
arXiv Detail & Related papers (2022-10-06T11:45:14Z) - Automated Audio Captioning: an Overview of Recent Progress and New
Challenges [56.98522404673527]
Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips.
We present a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets.
arXiv Detail & Related papers (2022-05-12T08:36:35Z) - Audio-text Retrieval in Context [24.38055340045366]
In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment.
We build our contextual audio-text retrieval system using pre-trained audio features and a descriptor-based aggregation method.
With our proposed system, a significant improvement has been achieved on bidirectional audio-text retrieval, on all metrics including recall, median and mean rank.
arXiv Detail & Related papers (2022-03-25T13:41:17Z) - Visual Speech Recognition for Multiple Languages in the Wild [64.52593130370757]
We show that designing better VSR models is equally important to using larger training sets.
We propose the addition of prediction-based auxiliary tasks to a VSR model.
We show that such model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin.
arXiv Detail & Related papers (2022-02-26T07:21:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.