uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio Mixtures
- URL: http://arxiv.org/abs/2403.09579v1
- Date: Thu, 14 Mar 2024 17:13:37 GMT
- Title: uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio Mixtures
- Authors: Afrina Tabassum, Dung Tran, Trung Dang, Ismini Lourentzou, Kazuhito Koishida,
- Abstract summary: Masked Autoencoders (MAEs) learn rich low-level representations from unlabeled data.
ID emphasizes high-level semantics, offering a potential solution to alleviate annotation requirements in MAEs.
We introduce uaMix-MAE, an efficient ID tuning strategy that leverages unsupervised audio mixtures.
- Score: 16.59243476473915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked Autoencoders (MAEs) learn rich low-level representations from unlabeled data but require substantial labeled data to effectively adapt to downstream tasks. Conversely, Instance Discrimination (ID) emphasizes high-level semantics, offering a potential solution to alleviate annotation requirements in MAEs. Although combining these two approaches can address downstream tasks with limited labeled data, naively integrating ID into MAEs leads to extended training times and high computational costs. To address this challenge, we introduce uaMix-MAE, an efficient ID tuning strategy that leverages unsupervised audio mixtures. Utilizing contrastive tuning, uaMix-MAE aligns the representations of pretrained MAEs, thereby facilitating effective adaptation to task-specific semantics. To optimize the model with small amounts of unlabeled data, we propose an audio mixing technique that manipulates audio samples in both input and virtual label spaces. Experiments in low/few-shot settings demonstrate that \modelname achieves 4-6% accuracy improvements over various benchmarks when tuned with limited unlabeled data, such as AudioSet-20K. Code is available at https://github.com/PLAN-Lab/uamix-MAE
Related papers
- Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models [60.857389526958485]
MATA is a training-free method that dynamically pushes LALMs to pay textbfMore textbfAttention textbfTo textbfAudio tokens within the self-attention mechanism.<n>Experiments on the MMAU and MMAR benchmarks confirm MATA's effectiveness, with consistent performance gains.
arXiv Detail & Related papers (2025-09-23T09:02:15Z) - MUDAS: Mote-scale Unsupervised Domain Adaptation in Multi-label Sound Classification [4.60781637717838]
Unsupervised Domain Adaptation (UDA) is essential for adapting machine learning models to new, unlabeled environments.<n>Mote-scale Unsupervised Domain Adaptation for Sounds (MUDAS) is a framework developed for multi-label sound classification in resource-constrained IoT settings.
arXiv Detail & Related papers (2025-06-12T22:02:07Z) - From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z) - Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples [55.2480439325792]
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs.<n>These models often hallucinate non-existent sound events, reducing their reliability in real-world applications.<n>We propose LISTEN, a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds.
arXiv Detail & Related papers (2025-05-20T15:44:01Z) - Make Some Noise: Towards LLM audio reasoning and generation using sound tokens [19.48089933713418]
We introduce a novel approach that combines Variational Quantization with Flow Matching to convert audio into ultra-low discrete tokens of 0.23kpbs.
Our tokenizer outperforms a traditional VQ-VAE across various datasets with diverse acoustic events.
arXiv Detail & Related papers (2025-03-28T09:43:47Z) - Enhancing Sample Utilization in Noise-Robust Deep Metric Learning With Subgroup-Based Positive-Pair Selection [84.78475642696137]
The existence of noisy labels in real-world data negatively impacts the performance of deep learning models.
We propose a noise-robust DML framework with SubGroup-based Positive-pair Selection (SGPS)
SGPS constructs reliable positive pairs for noisy samples to enhance the sample utilization.
arXiv Detail & Related papers (2025-01-19T14:41:55Z) - Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.
To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization.
To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z) - Universal Sound Separation with Self-Supervised Audio Masked Autoencoder [35.560261097213846]
We propose integrating a self-supervised pre-trained model, namely the audio masked autoencoder (A-MAE), into a universal sound separation system.
The proposed methods successfully enhance the separation performance of a state-of-the-art ResUNet-based USS model.
arXiv Detail & Related papers (2024-07-16T14:11:44Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Learning from Training Dynamics: Identifying Mislabeled Data Beyond
Manually Designed Features [43.41573458276422]
We introduce a novel learning-based solution, leveraging a noise detector, instanced by an LSTM network.
The proposed method trains the noise detector in a supervised manner using the dataset with synthesized label noises.
Results show that the proposed method precisely detects mislabeled samples on various datasets without further adaptation.
arXiv Detail & Related papers (2022-12-19T09:39:30Z) - Learning from Noisy Labels with Coarse-to-Fine Sample Credibility
Modeling [22.62790706276081]
Training deep neural network (DNN) with noisy labels is practically challenging.
Previous efforts tend to handle part or full data in a unified denoising flow.
We propose a coarse-to-fine robust learning method called CREMA to handle noisy data in a divide-and-conquer manner.
arXiv Detail & Related papers (2022-08-23T02:06:38Z) - SimT: Handling Open-set Noise for Domain Adaptive Semantic Segmentation [58.61946589036262]
This paper studies a practical domain adaptive (DA) semantic segmentation problem where only pseudo-labeled target data is accessible through a black-box model.
Due to the domain gap and label shift between two domains, pseudo-labeled target data contains mixed closed-set and open-set label noises.
We propose a simplex noise transition matrix (SimT) to model the mixed noise distributions in DA semantic segmentation and formulate the problem as estimation of SimT.
arXiv Detail & Related papers (2022-03-29T02:48:08Z) - Harnessing Hard Mixed Samples with Decoupled Regularizer [69.98746081734441]
Mixup is an efficient data augmentation approach that improves the generalization of neural networks by smoothing the decision boundary with mixed data.
In this paper, we propose an efficient mixup objective function with a decoupled regularizer named Decoupled Mixup (DM)
DM can adaptively utilize hard mixed samples to mine discriminative features without losing the original smoothness of mixup.
arXiv Detail & Related papers (2022-03-21T07:12:18Z) - Environmental sound analysis with mixup based multitask learning and
cross-task fusion [0.12891210250935145]
acoustic scene classification and acoustic event classification are two closely related tasks.
In this letter, a two-stage method is proposed for the above tasks.
The proposed method has confirmed the complementary characteristics of acoustic scene and acoustic event classifications.
arXiv Detail & Related papers (2021-03-30T05:11:53Z) - MixSpeech: Data Augmentation for Low-resource Automatic Speech
Recognition [54.84624870942339]
MixSpeech is a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR)
We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer.
Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation.
arXiv Detail & Related papers (2021-02-25T03:40:43Z) - Suppressing Mislabeled Data via Grouping and Self-Attention [60.14212694011875]
Deep networks achieve excellent results on large-scale clean data but degrade significantly when learning from noisy labels.
This paper proposes a conceptually simple yet efficient training block, termed as Attentive Feature Mixup (AFM)
It allows paying more attention to clean samples and less to mislabeled ones via sample interactions in small groups.
arXiv Detail & Related papers (2020-10-29T13:54:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.