TAIL: Text-Audio Incremental Learning
- URL: http://arxiv.org/abs/2503.04258v1
- Date: Thu, 06 Mar 2025 09:39:36 GMT
- Title: TAIL: Text-Audio Incremental Learning
- Authors: Yingfei Sun, Xu Gu, Wei Ji, Hanbin Zhao, Hao Fei, Yifang Yin, Roger Zimmermann,
- Abstract summary: Introducing new datasets may affect the feature space of the original dataset, leading to catastrophic forgetting.<n>We introduce a novel task called Text-Audio Incremental Learning task for text-audio retrieval.<n>We propose a new method, PTAT, Prompt Tuning for Audio-Text incremental learning.
- Score: 40.43860056218282
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many studies combine text and audio to capture multi-modal information but they overlook the model's generalization ability on new datasets. Introducing new datasets may affect the feature space of the original dataset, leading to catastrophic forgetting. Meanwhile, large model parameters can significantly impact training performance. To address these limitations, we introduce a novel task called Text-Audio Incremental Learning (TAIL) task for text-audio retrieval, and propose a new method, PTAT, Prompt Tuning for Audio-Text incremental learning. This method utilizes prompt tuning to optimize the model parameters while incorporating an audio-text similarity and feature distillation module to effectively mitigate catastrophic forgetting. We benchmark our method and previous incremental learning methods on AudioCaps, Clotho, BBC Sound Effects and Audioset datasets, and our method outperforms previous methods significantly, particularly demonstrating stronger resistance to forgetting on older datasets. Compared to the full-parameters Finetune (Sequential) method, our model only requires 2.42\% of its parameters, achieving 4.46\% higher performance.
Related papers
- Language-based Audio Retrieval with Co-Attention Networks [22.155383794829977]
We introduce a novel framework for the language-based audio retrieval task.<n>We propose a cascaded co-attention architecture, where co-attention modules are stacked or iterated to refine the semantic alignment between text and audio.<n>Experiments conducted on two public datasets show that the proposed method can achieve better performance than the state-of-the-art method.
arXiv Detail & Related papers (2024-12-30T12:49:55Z) - Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.
To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization.
To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z) - AudioSetMix: Enhancing Audio-Language Datasets with LLM-Assisted Augmentations [1.2101820447447276]
Multi-modal learning in the audio-language domain has seen significant advancements in recent years.
However, audio-language learning faces challenges due to limited and lower-quality data compared to image-language tasks.
Our method systematically generates audio-caption pairs by augmenting audio clips with natural language labels and corresponding audio signal processing operations.
This scalable method produces AudioSetMix, a high-quality training dataset for text-and-audio related models.
arXiv Detail & Related papers (2024-05-17T21:08:58Z) - Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation [46.657781785006506]
We introduce the mini-batch Learning-to-match (m-LTM) framework for audio-text retrieval problems.
We conduct experiments on audio-text matching problems using three datasets: AudioCaps, Clotho, and ESC-50.
Our proposed method is capable of learning rich and expressive joint embedding space, which achieves SOTA performance.
arXiv Detail & Related papers (2024-05-16T13:28:10Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Enhancing Black-Box Few-Shot Text Classification with Prompt-Based Data
Augmentation [42.05617728412819]
We show how to optimize few-shot text classification without accessing the gradients of the large-scale language models.
Our approach, dubbed BT-Classifier, significantly outperforms state-of-the-art black-box few-shot learners.
arXiv Detail & Related papers (2023-05-23T07:54:34Z) - Text-to-Audio Generation using Instruction-Tuned LLM and Latent
Diffusion Model [23.058939018350603]
Large language models (LLM) allow many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning.
We adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation.
Our approach TANGO outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set.
arXiv Detail & Related papers (2023-04-24T07:45:28Z) - Improving Natural-Language-based Audio Retrieval with Transfer Learning
and Audio & Text Augmentations [7.817685358710508]
We propose a system to project recordings and textual descriptions into a shared audio-caption space.
Our results show that the used augmentations strategies reduce overfitting and improve retrieval performance.
We further show that pre-training the system on the AudioCaps dataset leads to additional improvements.
arXiv Detail & Related papers (2022-08-24T11:54:42Z) - Learning Phone Recognition from Unpaired Audio and Phone Sequences Based
on Generative Adversarial Network [58.82343017711883]
This paper investigates how to learn directly from unpaired phone sequences and speech utterances.
GAN training is adopted in the first stage to find the mapping relationship between unpaired speech and phone sequence.
In the second stage, another HMM model is introduced to train from the generator's output, which boosts the performance.
arXiv Detail & Related papers (2022-07-29T09:29:28Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [103.85002875155551]
We propose a novel generalized distillation method, TeachText, for exploiting large-scale language pretraining.
We extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time.
Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time.
arXiv Detail & Related papers (2021-04-16T17:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.