Multi-task Language Modeling for Improving Speech Recognition of Rare
Words
- URL: http://arxiv.org/abs/2011.11715v4
- Date: Sat, 11 Sep 2021 21:58:38 GMT
- Title: Multi-task Language Modeling for Improving Speech Recognition of Rare
Words
- Authors: Chao-Han Huck Yang, Linda Liu, Ankur Gandhe, Yile Gu, Anirudh Raju,
Denis Filimonov, Ivan Bulyko
- Abstract summary: We propose a second-pass system with multi-task learning, utilizing semantic targets (such as intent and slot prediction) to improve speech recognition performance.
Our best ASR system with multi-task LM shows 4.6% WERR deduction compared with RNN Transducer only ASR baseline for rare words recognition.
- Score: 14.745696312889763
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: End-to-end automatic speech recognition (ASR) systems are increasingly
popular due to their relative architectural simplicity and competitive
performance. However, even though the average accuracy of these systems may be
high, the performance on rare content words often lags behind hybrid ASR
systems. To address this problem, second-pass rescoring is often applied
leveraging upon language modeling. In this paper, we propose a second-pass
system with multi-task learning, utilizing semantic targets (such as intent and
slot prediction) to improve speech recognition performance. We show that our
rescoring model trained with these additional tasks outperforms the baseline
rescoring model, trained with only the language modeling task, by 1.4% on a
general test and by 2.6% on a rare word test set in terms of word-error-rate
relative (WERR). Our best ASR system with multi-task LM shows 4.6% WERR
deduction compared with RNN Transducer only ASR baseline for rare words
recognition.
Related papers
- Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking [68.77659513993507]
We present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy.
Our results show spoken language identification accuracy improvements of 8.7% and 6.1%, respectively, and word error rates which are 3.3% and 2.0% lower on these benchmarks.
arXiv Detail & Related papers (2024-09-27T03:31:32Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Heterogeneous Reservoir Computing Models for Persian Speech Recognition [0.0]
Reservoir computing models (RC) models have been proven inexpensive to train, have vastly fewer parameters, and are compatible with emergent hardware technologies.
We propose heterogeneous single and multi-layer ESNs to create non-linear transformations of the inputs that capture temporal context at different scales.
arXiv Detail & Related papers (2022-05-25T09:15:15Z) - Cross-Modal ASR Post-Processing System for Error Correction and
Utterance Rejection [25.940199825317073]
We propose a cross-modal post-processing system for speech recognizers.
It fuses acoustic features and textual features from different modalities.
It joints a confidence estimator and an error corrector in multi-task learning fashion.
arXiv Detail & Related papers (2022-01-10T12:29:55Z) - Multi-turn RNN-T for streaming recognition of multi-party speech [2.899379040028688]
This work takes real-time applicability as the first priority in model design and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-RNN-T)
We introduce on-the-fly overlapping speech simulation during training, yielding 14% relative word error rate (WER) improvement on LibriSpeechMix test set.
We propose a novel multi-turn RNN-T (MT-RNN-T) model with an overlap-based target arrangement strategy that generalizes to an arbitrary number of speakers without changes in the model architecture.
arXiv Detail & Related papers (2021-12-19T17:22:58Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Multi-talker ASR for an unknown number of sources: Joint training of
source counting, separation and ASR [91.87500543591945]
We develop an end-to-end multi-talker automatic speech recognition system for an unknown number of active speakers.
Our experiments show very promising performance in counting accuracy, source separation and speech recognition.
Our system generalizes well to a larger number of speakers than it ever saw during training.
arXiv Detail & Related papers (2020-06-04T11:25:50Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.