Related papers: PACE: Pretrained Audio Continual Learning

PACE: Pretrained Audio Continual Learning

URL: http://arxiv.org/abs/2602.03355v1
Date: Tue, 03 Feb 2026 10:28:35 GMT
Title: PACE: Pretrained Audio Continual Learning
Authors: Chang Li, Kanglei Zhou, Liyuan Wang,
Abstract summary: We present the first systematic benchmark for audio continual learning (CL) with pretrained models (PTMs)<n>In addition, we introduce spectrogram-based boundary-aware perturbations to mitigate representation overlap and improve stability.<n>Experiments on six diverse audio CL benchmarks demonstrate that PACE substantially outperforms state-of-the-art baselines.
Score: 27.605574463021693
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Audio is a fundamental modality for analyzing speech, music, and environmental sounds. Although pretrained audio models have significantly advanced audio understanding, they remain fragile in real-world settings where data distributions shift over time. In this work, we present the first systematic benchmark for audio continual learning (CL) with pretrained models (PTMs), together with a comprehensive analysis of its unique challenges. Unlike in vision, where parameter-efficient fine-tuning (PEFT) has proven effective for CL, directly transferring such strategies to audio leads to poor performance. This stems from a fundamental property of audio backbones: they focus on low-level spectral details rather than structured semantics, causing severe upstream-downstream misalignment. Through extensive empirical study, we identify analytic classifiers with first-session adaptation (FSA) as a promising direction, but also reveal two major limitations: representation saturation in coarse-grained scenarios and representation drift in fine-grained scenarios. To address these challenges, we propose PACE, a novel method that enhances FSA via a regularized analytic classifier and enables multi-session adaptation through adaptive subspace-orthogonal PEFT for improved semantic alignment. In addition, we introduce spectrogram-based boundary-aware perturbations to mitigate representation overlap and improve stability. Experiments on six diverse audio CL benchmarks demonstrate that PACE substantially outperforms state-of-the-art baselines, marking an important step toward robust and scalable audio continual learning with PTMs.

Related papers

A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection [2.432576583937997]
Spoof-SUPERB is a benchmark for audio deepfake detection.<n>We evaluate 20 SSL models spanning generative, discriminative, and spectrogram-based architectures.
arXiv Detail & Related papers (2026-03-02T05:45:55Z)
Self-Supervised Learning for Speaker Recognition: A study and review [0.0]
Self-Supervised Learning (SSL) has emerged as a promising paradigm, leveraging vast amounts of unlabeled data to learn relevant representations.<n>The application of SSL for Automatic Speech Recognition (ASR) has been extensively studied, but research on other downstream tasks, notably Speaker Recognition (SR) remains in its early stages.<n>This work aims to highlight recent trends and advancements, identifying current challenges in the field.
arXiv Detail & Related papers (2026-02-11T13:16:07Z)
Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z)
CORD: Bridging the Audio-Text Reasoning Gap via Weighted On-policy Cross-modal Distillation [32.72685791637924]
We propose CORD, a unified alignment framework that performs online cross-modal self-distillation.<n>Specifically, it aligns audio-conditioned reasoning with its text-conditioned counterpart within a unified model.<n> Empirical results across multiple benchmarks demonstrate that CORD consistently enhances audio-conditioned reasoning.
arXiv Detail & Related papers (2026-01-23T08:31:24Z)
Harmonizing the Arabic Audio Space with Data Scheduling [15.84874997729878]
This paper presents the first systematic study of multi-task instruction tuning for an Arabic-centric audio LLM.<n>We fine-tune Qwen2.5- Omni (7B) and propose Task-Progressive Curriculum (TPC) along with Aligner-Based Diverse Sampling (ADS)<n>Our results reveal a critical efficiency, robustness trade-off: while ADS accelerates initial convergence, its inherent gradient volatility can destabilize generative decoding under prolonged training.
arXiv Detail & Related papers (2026-01-18T17:08:31Z)
SEE: Signal Embedding Energy for Quantifying Noise Interference in Large Audio Language Models [49.313324100819955]
Signal Embedding Energy (SEE) is a method for quantifying the impact of noise intensity on LALM inputs.<n>SEE exhibits a strong correlation with LALM performance, achieving a correlation of 0.98.<n>This paper introduces a novel metric for noise quantification in LALMs, providing guidance for robustness improvements in real-world deployments.
arXiv Detail & Related papers (2026-01-12T08:57:55Z)
High-Fidelity Speech Enhancement via Discrete Audio Tokens [35.61634772862795]
DAC-SE1 is a language model-based SE framework leveraging discrete high-resolution audio representations.<n>Our experiments show that DAC-SE1 surpasses state-of-the-art autoregressive SE methods on both objective perceptual metrics and in a MUSHRA human evaluation.
arXiv Detail & Related papers (2025-10-02T16:38:05Z)
MATPAC++: Enhanced Masked Latent Prediction for Self-Supervised Audio Representation Learning [9.580895202050947]
Masked latent prediction has emerged as a leading paradigm in self-supervised learning (SSL)<n>This work proposes a novel enhancement: integrating Multiple Choice Learning (MCL) to explicitly model prediction ambiguity and improve representation quality.
arXiv Detail & Related papers (2025-08-18T08:10:07Z)
AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation [113.75682363364004]
AURORA is a framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation.<n>AURORA achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes effectively to unreferenced segmentation.
arXiv Detail & Related papers (2025-08-04T07:47:38Z)
Implicit Counterfactual Learning for Audio-Visual Segmentation [50.69377287012591]
We propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding.<n>Due to the lack of semantics, heterogeneous representations may lead to erroneous matches.<n>We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space.
arXiv Detail & Related papers (2025-07-28T11:46:35Z)
Advancing Test-Time Adaptation in Wild Acoustic Test Settings [26.05732574338255]
Speech signals follow short-term consistency, requiring specialized adaptation strategies. We propose a novel wild acoustic TTA method tailored for ASR fine-tuned acoustic foundation models. Our approach outperforms existing baselines under various wild acoustic test settings.
arXiv Detail & Related papers (2023-10-14T06:22:08Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
Cross-domain Adaptation with Discrepancy Minimization for Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments. We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.