Advancing Test-Time Adaptation in Wild Acoustic Test Settings
- URL: http://arxiv.org/abs/2310.09505v2
- Date: Sat, 05 Oct 2024 08:00:47 GMT
- Title: Advancing Test-Time Adaptation in Wild Acoustic Test Settings
- Authors: Hongfu Liu, Hengguan Huang, Ye Wang,
- Abstract summary: Speech signals follow short-term consistency, requiring specialized adaptation strategies.
We propose a novel wild acoustic TTA method tailored for ASR fine-tuned acoustic foundation models.
Our approach outperforms existing baselines under various wild acoustic test settings.
- Score: 26.05732574338255
- License:
- Abstract: Acoustic foundation models, fine-tuned for Automatic Speech Recognition (ASR), suffer from performance degradation in wild acoustic test settings when deployed in real-world scenarios. Stabilizing online Test-Time Adaptation (TTA) under these conditions remains an open and unexplored question. Existing wild vision TTA methods often fail to handle speech data effectively due to the unique characteristics of high-entropy speech frames, which are unreliably filtered out even when containing crucial semantic content. Furthermore, unlike static vision data, speech signals follow short-term consistency, requiring specialized adaptation strategies. In this work, we propose a novel wild acoustic TTA method tailored for ASR fine-tuned acoustic foundation models. Our method, Confidence-Enhanced Adaptation, performs frame-level adaptation using a confidence-aware weight scheme to avoid filtering out essential information in high-entropy frames. Additionally, we apply consistency regularization during test-time optimization to leverage the inherent short-term consistency of speech signals. Our experiments on both synthetic and real-world datasets demonstrate that our approach outperforms existing baselines under various wild acoustic test settings, including Gaussian noise, environmental sounds, accent variations, and sung speech.
Related papers
- CLAP-S: Support Set Based Adaptation for Downstream Fiber-optic Acoustic Recognition [28.006925515022882]
Contrastive Language-Audio Pretraining (CLAP) models have demonstrated unprecedented performance in acoustic signal recognition tasks.
We propose a support-based adaptation method, CLAP-S, which linearly interpolates a CLAP Adapter with the Support Set.
Experimental results show that our method delivers competitive performance on both laboratory-recorded fiber-optic ESC-50 datasets and a real-world fiber-optic gunshot-firework dataset.
arXiv Detail & Related papers (2025-01-16T23:22:17Z) - Towards Robust Transcription: Exploring Noise Injection Strategies for Training Data Augmentation [55.752737615873464]
This study investigates the impact of white noise at various Signal-to-Noise Ratio (SNR) levels on state-of-the-art APT models.
We hope this research provides valuable insights as preliminary work toward developing transcription models that maintain consistent performance across a range of acoustic conditions.
arXiv Detail & Related papers (2024-10-18T02:31:36Z) - Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation [25.410770364140856]
Cross-domain speech enhancement (SE) is often faced with severe challenges due to the scarcity of noise and background information in an unseen target domain.
This study puts forward a novel data simulation method to address this issue, leveraging noise-extractive techniques and generative adversarial networks (GANs)
We introduce the notion of dynamic perturbation, which can inject controlled perturbations into the noise embeddings during inference.
arXiv Detail & Related papers (2024-09-03T02:29:01Z) - Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network [23.034147003704483]
This study is focused on recovering from packet loss to improve the word error rate (WER) of ASR models.
We propose using a front-end adaptation network connected to a frozen ASR model.
Experiments demonstrate that the adaptation network, trained on Whisper's criteria, notably reduces word error rates across domains and languages.
arXiv Detail & Related papers (2024-06-27T06:40:01Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - AR-TTA: A Simple Method for Real-World Continual Test-Time Adaptation [1.4530711901349282]
We propose to validate test-time adaptation methods using datasets for autonomous driving, namely CLAD-C and SHIFT.
We observe that current test-time adaptation methods struggle to effectively handle varying degrees of domain shift.
We enhance the well-established self-training framework by incorporating a small memory buffer to increase model stability.
arXiv Detail & Related papers (2023-09-18T19:34:23Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Factorised Speaker-environment Adaptive Training of Conformer Speech
Recognition Systems [31.813788489512394]
This paper proposes a novel factorised speaker-environment adaptive training and test time adaptation approach for Conformer ASR models.
Experiments on the 300-hr WHAM noise corrupted Switchboard data suggest that factorised adaptation consistently outperforms the baseline.
Further analysis shows the proposed method offers potential for rapid adaption to unseen speaker-environment conditions.
arXiv Detail & Related papers (2023-06-26T11:32:05Z) - Listen, Adapt, Better WER: Source-free Single-utterance Test-time
Adaptation for Automatic Speech Recognition [65.84978547406753]
Test-time Adaptation aims to adapt the model trained on source domains to yield better predictions for test samples.
Single-Utterance Test-time Adaptation (SUTA) is the first TTA study in speech area to our best knowledge.
arXiv Detail & Related papers (2022-03-27T06:38:39Z) - AdaStereo: An Efficient Domain-Adaptive Stereo Matching Approach [50.855679274530615]
We present a novel domain-adaptive approach called AdaStereo to align multi-level representations for deep stereo matching networks.
Our models achieve state-of-the-art cross-domain performance on multiple benchmarks, including KITTI, Middlebury, ETH3D and DrivingStereo.
Our method is robust to various domain adaptation settings, and can be easily integrated into quick adaptation application scenarios and real-world deployments.
arXiv Detail & Related papers (2021-12-09T15:10:47Z) - Cross-domain Adaptation with Discrepancy Minimization for
Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments.
We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.