Related papers: Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval

Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval

URL: http://arxiv.org/abs/2408.11641v1
Date: Wed, 21 Aug 2024 14:10:58 GMT
Title: Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval
Authors: Paul Primus, Florian Schmid, Gerhard Widmer,
Abstract summary: We propose a two-staged training procedure in which multiple retrieval models are first trained without estimated correspondences. In the second stage, the audio-caption correspondences predicted by these models then serve as prediction targets. We evaluate our method on the ClothoV2 and the AudioCaps benchmark and show that it improves retrieval performance, even in a restricting self-distillation setting.
Score: 3.5570874721859016
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dual-encoder-based audio retrieval systems are commonly optimized with contrastive learning on a set of matching and mismatching audio-caption pairs. This leads to a shared embedding space in which corresponding items from the two modalities end up close together. Since audio-caption datasets typically only contain matching pairs of recordings and descriptions, it has become common practice to create mismatching pairs by pairing the audio with a caption randomly drawn from the dataset. This is not ideal because the randomly sampled caption could, just by chance, partly or entirely describe the audio recording. However, correspondence information for all possible pairs is costly to annotate and thus typically unavailable; we, therefore, suggest substituting it with estimated correspondences. To this end, we propose a two-staged training procedure in which multiple retrieval models are first trained as usual, i.e., without estimated correspondences. In the second stage, the audio-caption correspondences predicted by these models then serve as prediction targets. We evaluate our method on the ClothoV2 and the AudioCaps benchmark and show that it improves retrieval performance, even in a restricting self-distillation setting where a single model generates and then learns from the estimated correspondences. We further show that our method outperforms the current state of the art by 1.6 pp. mAP@10 on the ClothoV2 benchmark.

Related papers

Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes [16.530816405275715]
We present a unified model capable of simultaneously grounding both spoken language and non-speech sounds within a visual scene. Existing approaches are typically limited to handling either speech or non-speech sounds independently, or at best, together but sequentially without mixing.
arXiv Detail & Related papers (2025-03-24T16:56:04Z)
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z)
AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations. Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z)
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z)
A Complementary Joint Training Approach Using Unpaired Speech and Text for Low-Resource Automatic Speech Recognition [25.473191378558138]
We leverage unpaired data to train a general sequence-to-sequence model. Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair, we propose a complementary joint training(CJT) method.
arXiv Detail & Related papers (2022-04-05T07:02:53Z)
Audio-Visual Synchronisation in the wild [149.84890978170174]
We identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync. We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length. We set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.
arXiv Detail & Related papers (2021-12-08T17:50:26Z)
Using multiple reference audios and style embedding constraints for speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios. The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z)
Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations [49.55361944105796]
We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence framework. A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker.
arXiv Detail & Related papers (2020-10-23T08:34:52Z)
Audio-based Near-Duplicate Video Retrieval with Audio Similarity Learning [19.730467023817123]
We propose the Audio Similarity Learning (AuSiL) approach that effectively captures temporal patterns of audio similarity between video pairs. We train our network following a triplet generation process and optimize the triplet loss function. The proposed approach achieves very competitive results compared to three state-of-the-art methods.
arXiv Detail & Related papers (2020-10-17T08:12:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.