Advancing Test-Time Adaptation for Acoustic Foundation Models in
Open-World Shifts
- URL: http://arxiv.org/abs/2310.09505v1
- Date: Sat, 14 Oct 2023 06:22:08 GMT
- Title: Advancing Test-Time Adaptation for Acoustic Foundation Models in
Open-World Shifts
- Authors: Hongfu Liu, Hengguan Huang, Ye Wang
- Abstract summary: Test-Time Adaptation (TTA) is a critical paradigm for tackling distribution shifts during inference.
We introduce a learning-based adaptation enriched by confidence enhancement.
Our experiments on synthetic and real-world datasets affirm our method's superiority over existing baselines.
- Score: 29.28582280403953
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Test-Time Adaptation (TTA) is a critical paradigm for tackling distribution
shifts during inference, especially in visual recognition tasks. However, while
acoustic models face similar challenges due to distribution shifts in test-time
speech, TTA techniques specifically designed for acoustic modeling in the
context of open-world data shifts remain scarce. This gap is further
exacerbated when considering the unique characteristics of acoustic foundation
models: 1) they are primarily built on transformer architectures with layer
normalization and 2) they deal with test-time speech data of varying lengths in
a non-stationary manner. These aspects make the direct application of
vision-focused TTA methods, which are mostly reliant on batch normalization and
assume independent samples, infeasible. In this paper, we delve into TTA for
pre-trained acoustic models facing open-world data shifts. We find that noisy,
high-entropy speech frames, often non-silent, carry key semantic content.
Traditional TTA methods might inadvertently filter out this information using
potentially flawed heuristics. In response, we introduce a heuristic-free,
learning-based adaptation enriched by confidence enhancement. Noting that
speech signals' short-term consistency, we also apply consistency
regularization during test-time optimization. Our experiments on synthetic and
real-world datasets affirm our method's superiority over existing baselines.
Related papers
- Test-Time Training for Speech Enhancement [2.9598903898834497]
This paper introduces a novel application of Test-Time Training (TTT) for Speech Enhancement.<n>It addresses the challenges posed by unpredictable noise conditions and domain shifts.<n>We show consistent improvements across speech quality metrics, outperforming the baseline model.
arXiv Detail & Related papers (2025-08-03T17:02:55Z) - Adaptive Control Attention Network for Underwater Acoustic Localization and Domain Adaptation [8.017203108408973]
Localizing acoustic sound sources in the ocean is a challenging task due to the complex and dynamic nature of the environment.<n>We propose a multi-branch network architecture designed to accurately predict the distance between a moving acoustic source and a receiver.<n>Our proposed method outperforms state-of-the-art (SOTA) approaches in similar settings.
arXiv Detail & Related papers (2025-06-20T18:13:30Z) - E-BATS: Efficient Backpropagation-Free Test-Time Adaptation for Speech Foundation Models [11.696474872520808]
Speech Foundation Models encounter significant performance degradation when deployed in real-world scenarios involving acoustic domain shifts.<n>Test-time adaptation (TTA) has emerged as a viable strategy to address such domain shifts at inference time without requiring access to source data or labels.<n>E-BATS is the first Efficient BAckpropagation-free TTA framework designed explicitly for speech foundation models.
arXiv Detail & Related papers (2025-06-08T10:33:37Z) - Noise Augmented Fine Tuning for Mitigating Hallucinations in Large Language Models [1.0579965347526206]
Large language models (LLMs) often produce inaccurate or misleading content-hallucinations.
Noise-Augmented Fine-Tuning (NoiseFiT) is a novel framework that leverages adaptive noise injection to enhance model robustness.
NoiseFiT selectively perturbs layers identified as either high-SNR (more robust) or low-SNR (potentially under-regularized) using a dynamically scaled Gaussian noise.
arXiv Detail & Related papers (2025-04-04T09:27:19Z) - Boosting the Transferability of Audio Adversarial Examples with Acoustic Representation Optimization [4.720552406377147]
We propose a technique that aligns adversarial perturbations with low-level acoustic characteristics derived from speech representation models.
Our method is plug-and-play and can be integrated with any existing attack methods.
arXiv Detail & Related papers (2025-03-25T12:14:10Z) - CLAP-S: Support Set Based Adaptation for Downstream Fiber-optic Acoustic Recognition [28.006925515022882]
Contrastive Language-Audio Pretraining (CLAP) models have demonstrated unprecedented performance in acoustic signal recognition tasks.
We propose a support-based adaptation method, CLAP-S, which linearly interpolates a CLAP Adapter with the Support Set.
Experimental results show that our method delivers competitive performance on both laboratory-recorded fiber-optic ESC-50 datasets and a real-world fiber-optic gunshot-firework dataset.
arXiv Detail & Related papers (2025-01-16T23:22:17Z) - Towards Robust Transcription: Exploring Noise Injection Strategies for Training Data Augmentation [55.752737615873464]
This study investigates the impact of white noise at various Signal-to-Noise Ratio (SNR) levels on state-of-the-art APT models.
We hope this research provides valuable insights as preliminary work toward developing transcription models that maintain consistent performance across a range of acoustic conditions.
arXiv Detail & Related papers (2024-10-18T02:31:36Z) - Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation [25.410770364140856]
Cross-domain speech enhancement (SE) is often faced with severe challenges due to the scarcity of noise and background information in an unseen target domain.
This study puts forward a novel data simulation method to address this issue, leveraging noise-extractive techniques and generative adversarial networks (GANs)
We introduce the notion of dynamic perturbation, which can inject controlled perturbations into the noise embeddings during inference.
arXiv Detail & Related papers (2024-09-03T02:29:01Z) - Audio Enhancement for Computer Audition -- An Iterative Training Paradigm Using Sample Importance [42.90024643696503]
We present an end-to-end learning solution to jointly optimise the models for audio enhancement.
We consider four representative applications to evaluate our training paradigm.
arXiv Detail & Related papers (2024-08-12T16:23:58Z) - Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network [23.034147003704483]
This study is focused on recovering from packet loss to improve the word error rate (WER) of ASR models.
We propose using a front-end adaptation network connected to a frozen ASR model.
Experiments demonstrate that the adaptation network, trained on Whisper's criteria, notably reduces word error rates across domains and languages.
arXiv Detail & Related papers (2024-06-27T06:40:01Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - AR-TTA: A Simple Method for Real-World Continual Test-Time Adaptation [1.4530711901349282]
We propose to validate test-time adaptation methods using datasets for autonomous driving, namely CLAD-C and SHIFT.
We observe that current test-time adaptation methods struggle to effectively handle varying degrees of domain shift.
We enhance the well-established self-training framework by incorporating a small memory buffer to increase model stability.
arXiv Detail & Related papers (2023-09-18T19:34:23Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Factorised Speaker-environment Adaptive Training of Conformer Speech
Recognition Systems [31.813788489512394]
This paper proposes a novel factorised speaker-environment adaptive training and test time adaptation approach for Conformer ASR models.
Experiments on the 300-hr WHAM noise corrupted Switchboard data suggest that factorised adaptation consistently outperforms the baseline.
Further analysis shows the proposed method offers potential for rapid adaption to unseen speaker-environment conditions.
arXiv Detail & Related papers (2023-06-26T11:32:05Z) - Listen, Adapt, Better WER: Source-free Single-utterance Test-time
Adaptation for Automatic Speech Recognition [65.84978547406753]
Test-time Adaptation aims to adapt the model trained on source domains to yield better predictions for test samples.
Single-Utterance Test-time Adaptation (SUTA) is the first TTA study in speech area to our best knowledge.
arXiv Detail & Related papers (2022-03-27T06:38:39Z) - AdaStereo: An Efficient Domain-Adaptive Stereo Matching Approach [50.855679274530615]
We present a novel domain-adaptive approach called AdaStereo to align multi-level representations for deep stereo matching networks.
Our models achieve state-of-the-art cross-domain performance on multiple benchmarks, including KITTI, Middlebury, ETH3D and DrivingStereo.
Our method is robust to various domain adaptation settings, and can be easily integrated into quick adaptation application scenarios and real-world deployments.
arXiv Detail & Related papers (2021-12-09T15:10:47Z) - Cross-domain Adaptation with Discrepancy Minimization for
Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments.
We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.