Noisy Training Improves E2E ASR for the Edge
- URL: http://arxiv.org/abs/2107.04677v1
- Date: Fri, 9 Jul 2021 20:56:20 GMT
- Title: Noisy Training Improves E2E ASR for the Edge
- Authors: Dilin Wang, Yuan Shangguan, Haichuan Yang, Pierce Chuang, Jiatong
Zhou, Meng Li, Ganesh Venkatesh, Ozlem Kalinli, Vikas Chandra
- Abstract summary: Automatic speech recognition (ASR) has become increasingly ubiquitous on modern edge devices.
E2E ASR models are prone to overfitting and have difficulties in generalizing to unseen testing data.
We present a simple yet effective noisy training strategy to further improve the E2E ASR model training.
- Score: 22.91184103295888
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Automatic speech recognition (ASR) has become increasingly ubiquitous on
modern edge devices. Past work developed streaming End-to-End (E2E) all-neural
speech recognizers that can run compactly on edge devices. However, E2E ASR
models are prone to overfitting and have difficulties in generalizing to unseen
testing data. Various techniques have been proposed to regularize the training
of ASR models, including layer normalization, dropout, spectrum data
augmentation and speed distortions in the inputs. In this work, we present a
simple yet effective noisy training strategy to further improve the E2E ASR
model training. By introducing random noise to the parameter space during
training, our method can produce smoother models at convergence that generalize
better. We apply noisy training to improve both dense and sparse
state-of-the-art Emformer models and observe consistent WER reduction.
Specifically, when training Emformers with 90% sparsity, we achieve 12% and 14%
WER improvements on the LibriSpeech Test-other and Test-clean data set,
respectively.
Related papers
- MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization [49.00754561435518]
MSRS achieves competitive results in VSR and AVSR with 21.1% and 0.9% WER on the LRS3 benchmark, while reducing training time by at least 2x.
We explore other sparse approaches and show that only MSRS enables training from scratch by implicitly masking the weights affected by vanishing gradients.
arXiv Detail & Related papers (2024-06-25T15:00:43Z) - Not All Steps are Equal: Efficient Generation with Progressive Diffusion
Models [62.155612146799314]
We propose a novel two-stage training strategy termed Step-Adaptive Training.
In the initial stage, a base denoising model is trained to encompass all timesteps.
We partition the timesteps into distinct groups, fine-tuning the model within each group to achieve specialized denoising capabilities.
arXiv Detail & Related papers (2023-12-20T03:32:58Z) - D4AM: A General Denoising Framework for Downstream Acoustic Models [45.04967351760919]
Speech enhancement (SE) can be used as a front-end strategy to aid automatic speech recognition (ASR) systems.
Existing training objectives of SE methods are not fully effective at integrating speech-text and noisy-clean paired data for training toward unseen ASR systems.
We propose a general denoising framework, D4AM, for various downstream acoustic models.
arXiv Detail & Related papers (2023-11-28T08:27:27Z) - Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels [100.43280310123784]
We investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size.
We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions.
The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3.
arXiv Detail & Related papers (2023-03-25T00:37:34Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - A Likelihood Ratio based Domain Adaptation Method for E2E Models [10.510472957585646]
End-to-end (E2E) automatic speech recognition models like Recurrent Neural Networks Transducer (RNN-T) are becoming a popular choice for streaming ASR applications like voice assistants.
While E2E models are very effective at learning representation of the training data they are trained on, their accuracy on unseen domains remains a challenging problem.
In this work, we explore a contextual biasing approach using likelihood-ratio that leverages text data sources to adapt RNN-T model to new domains and entities.
arXiv Detail & Related papers (2022-01-10T21:22:39Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - MEST: Accurate and Fast Memory-Economic Sparse Training Framework on the
Edge [72.16021611888165]
This paper proposes a novel Memory-Economic Sparse Training (MEST) framework targeting for accurate and fast execution on edge devices.
The proposed MEST framework consists of enhancements by Elastic Mutation (EM) and Soft Memory Bound (&S)
Our results suggest that unforgettable examples can be identified in-situ even during the dynamic exploration of sparsity masks.
arXiv Detail & Related papers (2021-10-26T21:15:17Z) - SynthASR: Unlocking Synthetic Data for Speech Recognition [15.292920497489925]
We propose to utilize synthetic speech for ASR training ( SynthASR) in applications where data is sparse or hard to get for ASR model training.
In our experiments conducted on in-house datasets for a new application of recognizing medication names, training ASR RNN-T models with synthetic audio improved the recognition performance on new application by more than 65% relative.
arXiv Detail & Related papers (2021-06-14T23:26:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.