Related papers: T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining

T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining

URL: http://arxiv.org/abs/2404.17806v1
Date: Sat, 27 Apr 2024 07:05:48 GMT
Title: T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining
Authors: Yi Yuan, Zhuo Chen, Xubo Liu, Haohe Liu, Xuenan Xu, Dongya Jia, Yuanzhe Chen, Mark D. Plumbley, Wenwu Wang,
Abstract summary: Contrastive language-audio pretraining(CLAP) has been developed to align the representations of audio and language. We introduce T-CLAP, a temporal-enhanced CLAP model, to capture temporal information within audio and text features. T-CLAP shows improved capability in capturing the temporal relationship of sound events and outperforms state-of-the-art models by a significant margin.
Score: 38.604112878493396
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Contrastive language-audio pretraining~(CLAP) has been developed to align the representations of audio and language, achieving remarkable performance in retrieval and classification tasks. However, current CLAP struggles to capture temporal information within audio and text features, presenting substantial limitations for tasks such as audio retrieval and generation. To address this gap, we introduce T-CLAP, a temporal-enhanced CLAP model. We use Large Language Models~(LLMs) and mixed-up strategies to generate temporal-contrastive captions for audio clips from extensive audio-text datasets. Subsequently, a new temporal-focused contrastive loss is designed to fine-tune the CLAP model by incorporating these synthetic data. We conduct comprehensive experiments and analysis in multiple downstream tasks. T-CLAP shows improved capability in capturing the temporal relationship of sound events and outperforms state-of-the-art models by a significant margin.

Related papers

From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples [55.2480439325792]
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs.<n>These models often hallucinate non-existent sound events, reducing their reliability in real-world applications.<n>We propose LISTEN, a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds.
arXiv Detail & Related papers (2025-05-20T15:44:01Z)
TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining [3.5570874721859016]
We propose a frame-wise contrastive training strategy that learns to align text descriptions with temporal regions in an audio recording.<n>Our model has better temporal text-audio alignment abilities compared to models trained only on global captions when evaluated on the AudioSet Strong benchmark.
arXiv Detail & Related papers (2025-05-12T14:30:39Z)
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors. We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models. In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z)
Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs [3.8300818830608345]
Multi-modal contrastive learning strategies for audio and text have rapidly gained interest. The ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research. We propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL.
arXiv Detail & Related papers (2024-08-17T18:53:17Z)
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models [53.48409081555687]
In this work, we explore such large pre-trained models to obtain features, i.e. CLIP for visual features, and CLAP for audio features. We propose a simple yet effective model that only relies on feed-forward neural networks. Our framework achieves state-of-the-art performance on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL.
arXiv Detail & Related papers (2024-04-09T13:39:37Z)
tinyCLAP: Distilling Constrastive Language-Audio Pretrained Models [2.9619090219410515]
This paper investigates how we can reduce the complexity of contrastive language-audio pre-trained models. We derive an unimodal distillation loss from first principles and explore how the dimensionality of the shared, multimodal latent space can be reduced. TinyCLAP uses only 6% of the original Microsoft CLAP parameters with a minimal reduction (less than 5%) in zero-shot classification performance.
arXiv Detail & Related papers (2023-11-24T14:45:53Z)
Weakly-supervised Automated Audio Captioning via text only training [1.504795651143257]
We propose a weakly-supervised approach to train an AAC model assuming only text data and a pre-trained CLAP model. We evaluate our proposed method on Clotho and AudioCaps datasets demonstrating its ability to achieve a relative performance of up to $83%$ compared to fully supervised approaches.
arXiv Detail & Related papers (2023-09-21T16:40:46Z)
Leveraging Language Model Capabilities for Sound Event Detection [10.792576135806623]
We propose an end-to-end framework for understanding audio features while simultaneously generating sound event and their temporal location. Specifically, we employ pretrained acoustic models to capture discriminative features across different categories and language models for autoregressive text generation.
arXiv Detail & Related papers (2023-08-22T15:59:06Z)
ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event Classification [42.95038619688867]
ASiT is a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification.
arXiv Detail & Related papers (2022-11-23T18:21:09Z)
Self-Supervised Learning for speech recognition with Intermediate layer supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL) ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers. Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z)
Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker. This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons. Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.