Related papers: Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation

Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation

URL: http://arxiv.org/abs/2405.10084v1
Date: Thu, 16 May 2024 13:28:10 GMT
Title: Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation
Authors: Manh Luong, Khai Nguyen, Nhat Ho, Reza Haf, Dinh Phung, Lizhen Qu,
Abstract summary: We introduce the mini-batch Learning-to-match (m-LTM) framework for audio-text retrieval problems. We conduct experiments on audio-text matching problems using three datasets: AudioCaps, Clotho, and ESC-50. Our proposed method is capable of learning rich and expressive joint embedding space, which achieves SOTA performance.
Score: 46.657781785006506
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Learning-to-match (LTM) framework proves to be an effective inverse optimal transport approach for learning the underlying ground metric between two sources of data, facilitating subsequent matching. However, the conventional LTM framework faces scalability challenges, necessitating the use of the entire dataset each time the parameters of the ground metric are updated. In adapting LTM to the deep learning context, we introduce the mini-batch Learning-to-match (m-LTM) framework for audio-text retrieval problems. This framework leverages mini-batch subsampling and Mahalanobis-enhanced family of ground metrics. Moreover, to cope with misaligned training data in practice, we propose a variant using partial optimal transport to mitigate the harm of misaligned data pairs in training data. We conduct extensive experiments on audio-text matching problems using three datasets: AudioCaps, Clotho, and ESC-50. Results demonstrate that our proposed method is capable of learning rich and expressive joint embedding space, which achieves SOTA performance. Beyond this, the proposed m-LTM framework is able to close the modality gap across audio and text embedding, which surpasses both triplet and contrastive loss in the zero-shot sound event detection task on the ESC-50 dataset. Notably, our strategy of employing partial optimal transport with m-LTM demonstrates greater noise tolerance than contrastive loss, especially under varying noise ratios in training data on the AudioCaps dataset. Our code is available at https://github.com/v-manhlt3/m-LTM-Audio-Text-Retrieval

Related papers

From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Make Some Noise: Towards LLM audio reasoning and generation using sound tokens [19.48089933713418]
We introduce a novel approach that combines Variational Quantization with Flow Matching to convert audio into ultra-low discrete tokens of 0.23kpbs. Our tokenizer outperforms a traditional VQ-VAE across various datasets with diverse acoustic events.
arXiv Detail & Related papers (2025-03-28T09:43:47Z)
TAIL: Text-Audio Incremental Learning [40.43860056218282]
Introducing new datasets may affect the feature space of the original dataset, leading to catastrophic forgetting. We introduce a novel task called Text-Audio Incremental Learning task for text-audio retrieval. We propose a new method, PTAT, Prompt Tuning for Audio-Text incremental learning.
arXiv Detail & Related papers (2025-03-06T09:39:36Z)
TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch [18.661974399115007]
Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. In this work, we leverage a noise-robust audio tokenizer (S3Tokenizer) to design a simplified yet effective TTS data processing pipeline. This pipeline maintains data quality while substantially reducing data acquisition costs, achieving a data retention rate of over 50%.
arXiv Detail & Related papers (2024-12-11T09:38:50Z)
Disentangled Noisy Correspondence Learning [56.06801962154915]
Cross-modal retrieval is crucial in understanding latent correspondences across modalities. DisNCL is a novel information-theoretic framework for feature Disentanglement in Noisy Correspondence Learning.
arXiv Detail & Related papers (2024-08-10T09:49:55Z)
Towards Robust Few-shot Class Incremental Learning in Audio Classification using Contrastive Representation [1.3586572110652484]
Few-shot class-incremental learning addresses challenges arising from limited incoming data. We propose supervised contrastive learning to refine the representation space, enhancing discriminative power and leading to better generalization.
arXiv Detail & Related papers (2024-07-27T14:16:25Z)
Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024 [61.189875635090225]
Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST)
arXiv Detail & Related papers (2024-06-24T16:38:17Z)
Class Anchor Margin Loss for Content-Based Image Retrieval [97.81742911657497]
We propose a novel repeller-attractor loss that falls in the metric learning paradigm, yet directly optimize for the L2 metric without the need of generating pairs. We evaluate the proposed objective in the context of few-shot and full-set training on the CBIR task, by using both convolutional and transformer architectures.
arXiv Detail & Related papers (2023-06-01T12:53:10Z)
Noisy Correspondence Learning with Meta Similarity Correction [22.90696057856008]
multimodal learning relies on correct correspondence among multimedia data. Most widely used datasets are harvested from the Internet and inevitably contain mismatched pairs. We propose a Meta Similarity Correction Network (MSCN) to provide reliable similarity scores.
arXiv Detail & Related papers (2023-04-13T05:20:45Z)
Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities. Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z)
Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal Retrieval [7.459223771397159]
Cross-modal data (e.g. audiovisual) have different distributions and representations that cannot be directly compared. To bridge the gap between audiovisual modalities, we learn a common subspace for them by utilizing the intrinsic correlation in the natural synchronization of audio-visual data with the aid of annotated labels. We propose a new AV-CMR model to optimize semantic features by directly predicting labels and then measuring the intrinsic correlation between audio-visual data using complete cross-triple loss.
arXiv Detail & Related papers (2022-11-07T10:37:14Z)
Environmental sound analysis with mixup based multitask learning and cross-task fusion [0.12891210250935145]
acoustic scene classification and acoustic event classification are two closely related tasks. In this letter, a two-stage method is proposed for the above tasks. The proposed method has confirmed the complementary characteristics of acoustic scene and acoustic event classifications.
arXiv Detail & Related papers (2021-03-30T05:11:53Z)
Cross-Utterance Language Models with Acoustic Error Sampling [1.376408511310322]
Cross-utterance LM (CULM) is proposed to augment the input to a standard long short-term memory (LSTM) LM. An acoustic error sampling technique is proposed to reduce the mismatch between training and test-time. Experiments performed on both AMI and Switchboard datasets show that CULMs outperform the LSTM LM baseline WER.
arXiv Detail & Related papers (2020-08-19T17:40:11Z)
You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model. Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.