T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining
- URL: http://arxiv.org/abs/2404.17806v1
- Date: Sat, 27 Apr 2024 07:05:48 GMT
- Title: T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining
- Authors: Yi Yuan, Zhuo Chen, Xubo Liu, Haohe Liu, Xuenan Xu, Dongya Jia, Yuanzhe Chen, Mark D. Plumbley, Wenwu Wang,
- Abstract summary: Contrastive language-audio pretraining(CLAP) has been developed to align the representations of audio and language.
We introduce T-CLAP, a temporal-enhanced CLAP model, to capture temporal information within audio and text features.
T-CLAP shows improved capability in capturing the temporal relationship of sound events and outperforms state-of-the-art models by a significant margin.
- Score: 38.604112878493396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive language-audio pretraining~(CLAP) has been developed to align the representations of audio and language, achieving remarkable performance in retrieval and classification tasks. However, current CLAP struggles to capture temporal information within audio and text features, presenting substantial limitations for tasks such as audio retrieval and generation. To address this gap, we introduce T-CLAP, a temporal-enhanced CLAP model. We use Large Language Models~(LLMs) and mixed-up strategies to generate temporal-contrastive captions for audio clips from extensive audio-text datasets. Subsequently, a new temporal-focused contrastive loss is designed to fine-tune the CLAP model by incorporating these synthetic data. We conduct comprehensive experiments and analysis in multiple downstream tasks. T-CLAP shows improved capability in capturing the temporal relationship of sound events and outperforms state-of-the-art models by a significant margin.
Related papers
- Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models [53.48409081555687]
In this work, we explore such large pre-trained models to obtain features, i.e. CLIP for visual features, and CLAP for audio features.
We propose a simple yet effective model that only relies on feed-forward neural networks.
Our framework achieves state-of-the-art performance on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL.
arXiv Detail & Related papers (2024-04-09T13:39:37Z) - tinyCLAP: Distilling Constrastive Language-Audio Pretrained Models [2.9619090219410515]
This paper investigates how we can reduce the complexity of contrastive language-audio pre-trained models.
We derive an unimodal distillation loss from first principles and explore how the dimensionality of the shared, multimodal latent space can be reduced.
TinyCLAP uses only 6% of the original Microsoft CLAP parameters with a minimal reduction (less than 5%) in zero-shot classification performance.
arXiv Detail & Related papers (2023-11-24T14:45:53Z) - Weakly-supervised Automated Audio Captioning via text only training [1.504795651143257]
We propose a weakly-supervised approach to train an AAC model assuming only text data and a pre-trained CLAP model.
We evaluate our proposed method on Clotho and AudioCaps datasets demonstrating its ability to achieve a relative performance of up to $83%$ compared to fully supervised approaches.
arXiv Detail & Related papers (2023-09-21T16:40:46Z) - Furnishing Sound Event Detection with Language Model Abilities [11.435984426303419]
We propose an elegant method that aligns audio features and text features to accomplish sound event classification and temporal location.
The framework consists of an acoustic encoder, a contrastive module that align the corresponding representations of the text and audio, and a decoupled language decoder.
arXiv Detail & Related papers (2023-08-22T15:59:06Z) - SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic
Spaces [10.895310812568084]
We train a CLIP-based model with the aim to learn shared representations of phonetic and acoustic spaces.
Results show that the proposed model is sensible to phonetic changes.
We provide empirical evidence showing that the resulting embeddings are useful for a variety of downstream applications.
arXiv Detail & Related papers (2023-07-23T22:18:47Z) - ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event
Classification [42.95038619688867]
ASiT is a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation.
We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification.
arXiv Detail & Related papers (2022-11-23T18:21:09Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons.
Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.