Improving Generalization of Deep Neural Network Acoustic Models with
Length Perturbation and N-best Based Label Smoothing
- URL: http://arxiv.org/abs/2203.15176v1
- Date: Tue, 29 Mar 2022 01:40:22 GMT
- Title: Improving Generalization of Deep Neural Network Acoustic Models with
Length Perturbation and N-best Based Label Smoothing
- Authors: Xiaodong Cui, George Saon, Tohru Nagano, Masayuki Suzuki, Takashi
Fukuda, Brian Kingsbury, Gakuto Kurata
- Abstract summary: We introduce two techniques to improve generalization of deep neural network (DNN) acoustic models for automatic speech recognition (ASR)
Length perturbation is a data augmentation algorithm that randomly drops and inserts frames of an utterance to alter the length of the speech feature sequence.
N-best based label smoothing randomly injects noise to ground truth labels during training in order to avoid overfitting, where the noisy labels are generated from n-best hypotheses.
- Score: 49.82147684491619
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce two techniques, length perturbation and n-best based label
smoothing, to improve generalization of deep neural network (DNN) acoustic
models for automatic speech recognition (ASR). Length perturbation is a data
augmentation algorithm that randomly drops and inserts frames of an utterance
to alter the length of the speech feature sequence. N-best based label
smoothing randomly injects noise to ground truth labels during training in
order to avoid overfitting, where the noisy labels are generated from n-best
hypotheses. We evaluate these two techniques extensively on the 300-hour
Switchboard (SWB300) dataset and an in-house 500-hour Japanese (JPN500) dataset
using recurrent neural network transducer (RNNT) acoustic models for ASR. We
show that both techniques improve the generalization of RNNT models
individually and they can also be complementary. In particular, they yield good
improvements over a strong SWB300 baseline and give state-of-art performance on
SWB300 using RNNT models.
Related papers
- Memory-augmented conformer for improved end-to-end long-form ASR [9.876354589883002]
We propose a memory-augmented neural network between the encoder and decoder of a conformer.
This external memory can enrich the generalization for longer utterances.
We show that the proposed system outperforms the baseline conformer without memory for long utterances.
arXiv Detail & Related papers (2023-09-22T17:44:58Z) - Dilated convolutional neural network for detecting extreme-mass-ratio inspirals [8.809900732195281]
We introduce DECODE, an end-to-end model focusing on EMRI signal detection by sequence modeling in the frequency domain.
We evaluate our model on 1-year data with accumulated SNR ranging from 50 to 120 and achieve a true positive rate of 96.3% at a false positive rate of 1%.
arXiv Detail & Related papers (2023-08-31T03:16:38Z) - NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction [79.13750275141139]
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction.
The desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network.
A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details.
arXiv Detail & Related papers (2022-09-29T04:06:00Z) - Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z) - Speech-enhanced and Noise-aware Networks for Robust Speech Recognition [25.279902171523233]
A noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition.
The two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task.
Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively.
arXiv Detail & Related papers (2022-03-25T15:04:51Z) - Automated Atrial Fibrillation Classification Based on Denoising Stacked
Autoencoder and Optimized Deep Network [1.7403133838762446]
The incidences of atrial fibrillation (AFib) are increasing at a daunting rate worldwide.
For the early detection of the risk of AFib, we have developed an automatic detection system based on deep neural networks.
An end-to-end model is proposed to denoise the electrocardiogram signals using denoising autoencoders (DAE)
arXiv Detail & Related papers (2022-01-26T21:45:48Z) - Reducing Exposure Bias in Training Recurrent Neural Network Transducers [37.53697357406185]
We investigate approaches to reducing exposure bias in training to improve the generalization of RNNT models for automatic speech recognition.
We show that we can further improve the accuracy of a high-performance RNNT ASR model and obtain state-of-the-art results on the 300-hour Switchboard dataset.
arXiv Detail & Related papers (2021-08-24T15:43:42Z) - Deep Time Delay Neural Network for Speech Enhancement with Full Data
Learning [60.20150317299749]
This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning.
To make full use of the training data, we propose a full data learning method for speech enhancement.
arXiv Detail & Related papers (2020-11-11T06:32:37Z) - Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks [61.76338096980383]
A range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper- parameters of state-of-the-art factored time delay neural networks (TDNNs)
These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training.
Experiments conducted on a 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems.
arXiv Detail & Related papers (2020-07-17T08:32:11Z) - RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and
Solutions [73.45995446500312]
We analyze the generalization properties of streaming and non-streaming recurrent neural network transducer (RNN-T) based end-to-end models.
We propose two solutions: combining multiple regularization techniques during training, and using dynamic overlapping inference.
arXiv Detail & Related papers (2020-05-07T06:24:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.