Related papers: Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation

Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation

URL: http://arxiv.org/abs/2505.20606v1
Date: Tue, 27 May 2025 00:55:32 GMT
Title: Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation
Authors: Dancheng Liu, Amir Nassereldine, Chenhui Xu, Jinjun Xiong,
Abstract summary: Whisper's robust performance in automatic speech recognition (ASR) is often attributed to its massive 680k-hour training set.<n>We examine how linguistic and acoustic diversity in training data affect the robustness of the ASR model.<n>We find that targeted acoustic augmentation methods could significantly improve the generalization ability of ASR models.
Score: 18.678742816040856
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Whisper's robust performance in automatic speech recognition (ASR) is often attributed to its massive 680k-hour training set, an impractical scale for most researchers. In this work, we examine how linguistic and acoustic diversity in training data affect the robustness of the ASR model and reveal that transcription generalization is primarily driven by acoustic variation rather than linguistic richness. We find that targeted acoustic augmentation methods could significantly improve the generalization ability of ASR models, reducing word-error rates by up to 19.24 percent on unseen datasets when training on the 960-hour Librispeech dataset. These findings highlight strategic acoustically focused data augmentation as a promising alternative to massive datasets for building robust ASR models, offering a potential solution to future foundation ASR models when massive human speech data is lacking.

Related papers

An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR [12.197936305117407]
Augmenting the training data of automatic speech recognition systems with synthetic data generated by text-to-speech (TTS) or voice conversion (VC) has gained popularity in recent years.<n>We leverage recently proposed flow-based TTS/VC models allowing greater speech diversity, and assess the respective impact of augmenting various speech attributes on the word error rate (WER) achieved by several ASR models.
arXiv Detail & Related papers (2025-03-11T23:09:06Z)
Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap [46.607944227556]
We propose a cost-effective and practical approach to enhancing automatic speech recognition (ASR) performance using text-to-speech (TTS) models. Experiments on an unprecedentedly rich variety of low-resource datasets demonstrate consistent and substantial performance improvements. We study factors such as text diversity, speaker diversity, and the volume of synthesized data, with text diversity being studied for the first time in this work.
arXiv Detail & Related papers (2024-10-22T06:25:16Z)
Dynamic Data Pruning for Automatic Speech Recognition [58.95758272440217]
We introduce Dynamic Data Pruning for ASR (DDP-ASR), which offers fine-grained pruning granularities specifically tailored for speech-related datasets. Our experiments show that DDP-ASR can save up to 1.6x training time with negligible performance loss.
arXiv Detail & Related papers (2024-06-26T14:17:36Z)
Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping [1.7593130415737603]
This paper presents an end-to-end Automatic Speech Recognition (ASR) model trained on an extensive dataset of 570k hours of speech audio data. We generate pseudo-labels for the unlabeled public data using a strong Conformer RNN-T baseline model. The addition of these pseudo-labeled data results in remarkable improvements in relative Word Error Rate (WER) by 11.5% and 24.3% for our asynchronous and realtime models.
arXiv Detail & Related papers (2024-04-10T20:40:24Z)
Speech Robust Bench: A Robustness Benchmark For Speech Recognition [20.758654420612793]
Speech Robust Bench (SRB) is a benchmark for evaluating the robustness of ASR models to diverse corruptions.<n>SRB is composed of 114 input perturbations which simulate an heterogeneous range of corruptions that ASR models may encounter when deployed in the wild.
arXiv Detail & Related papers (2024-03-08T08:10:29Z)
Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models [48.44820587495038]
Self-supervised representation learning (SSRL) has demonstrated superior performance than supervised models for tasks including phoneme recognition. Training SSRL models poses a challenge for low-resource languages where sufficient pre-training data may not be available. We propose to use audio augmentation techniques, namely: pitch variation, noise addition, accented target language and other language speech to pre-train SSRL models in a low resource condition and evaluate phoneme recognition.
arXiv Detail & Related papers (2023-09-22T10:09:09Z)
Advancing African-Accented Speech Recognition: Epistemic Uncertainty-Driven Data Selection for Generalizable ASR Models [2.4654745083407175]
We propose a new multi-rounds adaptation process that uses uncertainty to automate the annotation process. This novel method streamlines data annotation and strategically selects data samples contributing most to model uncertainty. Our results show that our approach leads to a 27% WER relative average improvement while requiring on average 45% less data than established baselines.
arXiv Detail & Related papers (2023-06-03T13:11:37Z)
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels [100.43280310123784]
We investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size. We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions. The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3.
arXiv Detail & Related papers (2023-03-25T00:37:34Z)
An Experimental Study on Private Aggregation of Teacher Ensemble Learning for End-to-End Speech Recognition [51.232523987916636]
Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data. In this work, we extend PATE learning to work with dynamic patterns, namely speech, and perform one very first experimental study on ASR to avoid acoustic data leakage.
arXiv Detail & Related papers (2022-10-11T16:55:54Z)
Improving noise robust automatic speech recognition with single-channel time-domain enhancement network [100.1041336974175]
We show that a single-channel time-domain denoising approach can significantly improve ASR performance. We show that single-channel noise reduction can still improve ASR performance.
arXiv Detail & Related papers (2020-03-09T09:36:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.