Audio Contrastive-based Fine-tuning: Decoupling Representation Learning and Classification
- URL: http://arxiv.org/abs/2309.11895v4
- Date: Mon, 22 Sep 2025 01:22:54 GMT
- Title: Audio Contrastive-based Fine-tuning: Decoupling Representation Learning and Classification
- Authors: Yang Wang, Qibin Liang, Chenghao Xiao, Yizhi Li, Noura Al Moubayed, Chenghua Lin,
- Abstract summary: We propose a disentangled two-stage framework that separates representation refinement from downstream evaluation.<n>First, we employ a "contrastive-tuning" stage to explicitly improve the geometric structure of the model's embedding space.<n>We then introduce a dual-probe evaluation protocol to assess the quality of these refined representations from a geometric perspective.
- Score: 26.82307246813389
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Standard fine-tuning of pre-trained audio models couples representation learning with classifier training, which can obscure the true quality of the learned representations. In this work, we advocate for a disentangled two-stage framework that separates representation refinement from downstream evaluation. First, we employ a "contrastive-tuning" stage to explicitly improve the geometric structure of the model's embedding space. Subsequently, we introduce a dual-probe evaluation protocol to assess the quality of these refined representations from a geometric perspective. This protocol uses a linear probe to measure global linear separability and a k-Nearest Neighbours probe to investigate the local structure of class clusters. Our experiments on a diverse set of audio classification tasks show that our framework provides a better foundation for classification, leading to improved accuracy. Our newly proposed dual-probing framework acts as a powerful analytical lens, demonstrating why contrastive learning is more effective by revealing a superior embedding space. It significantly outperforms vanilla fine-tuning, particularly on single-label datasets with a large number of classes, and also surpasses strong baselines on multi-label tasks using a Jaccard-weighted loss. Our findings demonstrate that decoupling representation refinement from classifier training is a broadly effective strategy for unlocking the full potential of pre-trained audio models. Our code will be publicly available.
Related papers
- AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation [55.607230723223346]
This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges.<n>We explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking.<n>We introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on
arXiv Detail & Related papers (2025-07-17T00:39:18Z) - Discrete Audio Tokens: More Than a Survey! [107.69720675124255]
This paper presents a systematic review and benchmark of discrete audio tokenizers.<n>It covers speech, music, and general audio domains.<n>We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains.
arXiv Detail & Related papers (2025-06-12T01:35:43Z) - $C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR)
MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.
To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z) - Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [55.2480439325792]
Large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information.
These models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources.
arXiv Detail & Related papers (2024-10-21T15:55:27Z) - Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised
Audio-Visual Video Parsing [107.031903351176]
Weakly-separated audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances.
WS-AVVP aims to identify the corresponding event categories with only video-level category labels for training.
arXiv Detail & Related papers (2023-07-05T05:55:10Z) - Multi-source Domain Adaptation for Text-independent Forensic Speaker
Recognition [36.83842373791537]
Adapting speaker recognition systems to new environments is a widely-used technique to improve a well-performing model.
Previous studies focus on single domain adaptation, which neglects a more practical scenario where training data are collected from multiple acoustic domains.
Three novel adaptation methods are proposed to further promote adaptation performance across multiple acoustic domains.
arXiv Detail & Related papers (2022-11-17T22:11:25Z) - Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention.
The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z) - Fine-Grained Visual Classification using Self Assessment Classifier [12.596520707449027]
Extracting discriminative features plays a crucial role in the fine-grained visual classification task.
In this paper, we introduce a Self Assessment, which simultaneously leverages the representation of the image and top-k prediction classes.
We show that our method achieves new state-of-the-art results on CUB200-2011, Stanford Dog, and FGVC Aircraft datasets.
arXiv Detail & Related papers (2022-05-21T07:41:27Z) - Noise-Tolerant Learning for Audio-Visual Action Recognition [31.641972732424463]
Video datasets are usually coarse-annotated or collected from the Internet.
We propose a noise-tolerant learning framework to find anti-interference model parameters against both noisy labels and noisy correspondence.
Our method significantly improves the robustness of the action recognition model and surpasses the baselines by a clear margin.
arXiv Detail & Related papers (2022-05-16T12:14:03Z) - Dual Path Structural Contrastive Embeddings for Learning Novel Objects [6.979491536753043]
Recent research shows that gaining information on a good feature space can be an effective solution to achieve favorable performance on few-shot tasks.
We propose a simple but effective paradigm that decouples the tasks of learning feature representations and classifiers.
Our method can still achieve promising results for both standard and generalized few-shot problems in either an inductive or transductive inference setting.
arXiv Detail & Related papers (2021-12-23T04:43:31Z) - Prototypical Classifier for Robust Class-Imbalanced Learning [64.96088324684683]
We propose textitPrototypical, which does not require fitting additional parameters given the embedding network.
Prototypical produces balanced and comparable predictions for all classes even though the training set is class-imbalanced.
We test our method on CIFAR-10LT, CIFAR-100LT and Webvision datasets, observing that Prototypical obtains substaintial improvements compared with state of the arts.
arXiv Detail & Related papers (2021-10-22T01:55:01Z) - Noise-Resilient Ensemble Learning using Evidence Accumulation Clustering [1.7188280334580195]
Ensemble learning methods combine multiple algorithms performing the same task to build a group with superior quality.
These systems are well adapted to the distributed setup, where each peer or machine of the network hosts one algorithm and communicate its results to its peers.
However, the network can be corrupted, altering the prediction accuracy of a peer, which has a deleterious effect on the ensemble quality.
We propose a noise-resilient ensemble classification method, which helps to improve accuracy and correct random errors.
arXiv Detail & Related papers (2021-10-18T11:52:45Z) - Partner-Assisted Learning for Few-Shot Image Classification [54.66864961784989]
Few-shot Learning has been studied to mimic human visual capabilities and learn effective models without the need of exhaustive human annotation.
In this paper, we focus on the design of training strategy to obtain an elemental representation such that the prototype of each novel class can be estimated from a few labeled samples.
We propose a two-stage training scheme, which first trains a partner encoder to model pair-wise similarities and extract features serving as soft-anchors, and then trains a main encoder by aligning its outputs with soft-anchors while attempting to maximize classification performance.
arXiv Detail & Related papers (2021-09-15T22:46:19Z) - Neighborhood Contrastive Learning for Novel Class Discovery [79.14767688903028]
We build a new framework, named Neighborhood Contrastive Learning, to learn discriminative representations that are important to clustering performance.
We experimentally demonstrate that these two ingredients significantly contribute to clustering performance and lead our model to outperform state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2021-06-20T17:34:55Z) - No Fear of Heterogeneity: Classifier Calibration for Federated Learning
with Non-IID Data [78.69828864672978]
A central challenge in training classification models in the real-world federated system is learning with non-IID data.
We propose a novel and simple algorithm called Virtual Representations (CCVR), which adjusts the classifier using virtual representations sampled from an approximated ssian mixture model.
Experimental results demonstrate that CCVR state-of-the-art performance on popular federated learning benchmarks including CIFAR-10, CIFAR-100, and CINIC-10.
arXiv Detail & Related papers (2021-06-09T12:02:29Z) - Class-Balanced Distillation for Long-Tailed Visual Recognition [100.10293372607222]
Real-world imagery is often characterized by a significant imbalance of the number of images per class, leading to long-tailed distributions.
In this work, we introduce a new framework, by making the key observation that a feature representation learned with instance sampling is far from optimal in a long-tailed setting.
Our main contribution is a new training method, that leverages knowledge distillation to enhance feature representations.
arXiv Detail & Related papers (2021-04-12T08:21:03Z) - Robust Audio-Visual Instance Discrimination [79.74625434659443]
We present a self-supervised learning method to learn audio and video representations.
We address the problems of audio-visual instance discrimination and improve transfer learning performance.
arXiv Detail & Related papers (2021-03-29T19:52:29Z) - Improving Deep Learning Sound Events Classifiers using Gram Matrix
Feature-wise Correlations [1.2891210250935146]
In our method, we analyse all the activations of a generic CNN in order to produce feature representations using Gram Matrices.
The proposed approach can be applied to any CNN and our experimental evaluation of four different architectures on two datasets demonstrated that our method consistently improves the baseline models.
arXiv Detail & Related papers (2021-02-23T16:08:02Z) - Learning and Evaluating Representations for Deep One-class
Classification [59.095144932794646]
We present a two-stage framework for deep one-class classification.
We first learn self-supervised representations from one-class data, and then build one-class classifiers on learned representations.
In experiments, we demonstrate state-of-the-art performance on visual domain one-class classification benchmarks.
arXiv Detail & Related papers (2020-11-04T23:33:41Z) - Spectrum-Guided Adversarial Disparity Learning [52.293230153385124]
We propose a novel end-to-end knowledge directed adversarial learning framework.
It portrays the class-conditioned intraclass disparity using two competitive encoding distributions and learns the purified latent codes by denoising learned disparity.
The experiments on four HAR benchmark datasets demonstrate the robustness and generalization of our proposed methods over a set of state-of-the-art.
arXiv Detail & Related papers (2020-07-14T05:46:27Z) - A Sequential Self Teaching Approach for Improving Generalization in
Sound Event Recognition [11.559570255513217]
We propose a sequential self-teaching approach to learning sounds.
It is harder to learn sounds in adverse situations such as from weakly labeled and/or noisy labeled data.
Our proposal is a sequential stage-wise learning process that improves generalization capabilities of a given modeling system.
arXiv Detail & Related papers (2020-06-30T22:53:43Z) - COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio
Representations [32.456824945999465]
We propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags.
We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks.
arXiv Detail & Related papers (2020-06-15T13:17:18Z) - Audio Impairment Recognition Using a Correlation-Based Feature
Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs.
We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.