Audio Retrieval with WavText5K and CLAP Training
- URL: http://arxiv.org/abs/2209.14275v1
- Date: Wed, 28 Sep 2022 17:39:26 GMT
- Title: Audio Retrieval with WavText5K and CLAP Training
- Authors: Soham Deshmukh, Benjamin Elizalde, Huaming Wang
- Abstract summary: We propose a new collection of about five thousand web audio-text pairs that we refer to as WavText5K.
When used to train our retrieval system, WavText5K improved performance more than other audio captioning datasets.
Our framework learns to connect language and audio content by using a text encoder, two audio encoders, and a contrastive learning objective.
- Score: 8.362098382773265
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio-Text retrieval takes a natural language query to retrieve relevant
audio files in a database. Conversely, Text-Audio retrieval takes an audio file
as a query to retrieve relevant natural language descriptions. Most of the
literature train retrieval systems with one audio captioning dataset, but
evaluating the benefit of training with multiple datasets is underexplored.
Moreover, retrieval systems have to learn the alignment between elaborated
sentences describing audio content of variable length ranging from a few
seconds to several minutes. In this work, we propose a new collection of web
audio-text pairs and a new framework for retrieval. First, we provide a new
collection of about five thousand web audio-text pairs that we refer to as
WavText5K. When used to train our retrieval system, WavText5K improved
performance more than other audio captioning datasets. Second, our framework
learns to connect language and audio content by using a text encoder, two audio
encoders, and a contrastive learning objective. Combining both audio encoders
helps to process variable length audio. The two contributions beat state of the
art performance for AudioCaps and Clotho on Text-Audio retrieval by a relative
2% and 16%, and Audio-Text retrieval by 6% and 23%.
Related papers
- Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.
For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z) - Bridging Language Gaps in Audio-Text Retrieval [28.829775980536574]
We propose a language enhancement (LE) using a multilingual text encoder (SONAR) to encode the text data with language-specific information.
We optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval.
Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho.
arXiv Detail & Related papers (2024-06-11T07:12:12Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature
Alignment [16.304894187743013]
TEFAL is a TExt-conditioned Feature ALignment method that produces both audio and video representations conditioned on the text query.
Our approach employs two independent cross-modal attention blocks that enable the text to attend to the audio and video representations separately.
arXiv Detail & Related papers (2023-07-24T17:43:13Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research [82.42802570171096]
We introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions.
Online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning.
We propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically.
arXiv Detail & Related papers (2023-03-30T14:07:47Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Introducing Auxiliary Text Query-modifier to Content-based Audio
Retrieval [37.02112904035811]
The amount of audio data available on public websites is growing rapidly.
We propose a content-based audio retrieval method that can retrieve a target audio that is similar to but slightly different from the query audio.
arXiv Detail & Related papers (2022-07-20T08:19:54Z) - Audio-text Retrieval in Context [24.38055340045366]
In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment.
We build our contextual audio-text retrieval system using pre-trained audio features and a descriptor-based aggregation method.
With our proposed system, a significant improvement has been achieved on bidirectional audio-text retrieval, on all metrics including recall, median and mean rank.
arXiv Detail & Related papers (2022-03-25T13:41:17Z) - Connecting the Dots between Audio and Text without Parallel Data through
Visual Knowledge Transfer [40.85506152074302]
VIP-ANT induces textbfAudio-textbfText alignment without using parallel audio-text data.
Our work opens up new avenues for learning audio-text connections with little to no parallel audio-text data.
arXiv Detail & Related papers (2021-12-16T16:22:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.