Minimum Word Error Rate Training with Language Model Fusion for
End-to-End Speech Recognition
- URL: http://arxiv.org/abs/2106.02302v1
- Date: Fri, 4 Jun 2021 07:24:49 GMT
- Title: Minimum Word Error Rate Training with Language Model Fusion for
End-to-End Speech Recognition
- Authors: Zhong Meng, Yu Wu, Naoyuki Kanda, Liang Lu, Xie Chen, Guoli Ye, Eric
Sun, Jinyu Li, Yifan Gong
- Abstract summary: Internal language model estimation (ILME)-based LM fusion has shown significant word error rate (WER) reduction from Shallow Fusion.
We propose a novel MWER training with ILME (MWER-ILME) where the ILME-based fusion is conducted to generate N-best hypotheses and their posteriors.
MWER-ILME achieves on average 8.8% and 5.8% relative WER reductions from MWER and MWER-SF training, respectively, on 6 different test sets.
- Score: 82.60133751942854
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Integrating external language models (LMs) into end-to-end (E2E) models
remains a challenging task for domain-adaptive speech recognition. Recently,
internal language model estimation (ILME)-based LM fusion has shown significant
word error rate (WER) reduction from Shallow Fusion by subtracting a weighted
internal LM score from an interpolation of E2E model and external LM scores
during beam search. However, on different test sets, the optimal LM
interpolation weights vary over a wide range and have to be tuned extensively
on well-matched validation sets. In this work, we perform LM fusion in the
minimum WER (MWER) training of an E2E model to obviate the need for LM weights
tuning during inference. Besides MWER training with Shallow Fusion (MWER-SF),
we propose a novel MWER training with ILME (MWER-ILME) where the ILME-based
fusion is conducted to generate N-best hypotheses and their posteriors.
Additional gradient is induced when internal LM is engaged in MWER-ILME loss
computation. During inference, LM weights pre-determined in MWER training
enable robust LM integrations on test sets from different domains. Experimented
with 30K-hour trained transformer transducers, MWER-ILME achieves on average
8.8% and 5.8% relative WER reductions from MWER and MWER-SF training,
respectively, on 6 different test sets
Related papers
- Pre-training Language Model as a Multi-perspective Course Learner [103.17674402415582]
This study proposes a multi-perspective course learning (MCL) method for sample-efficient pre-training.
In this study, three self-supervision courses are designed to alleviate inherent flaws of "tug-of-war" dynamics.
Our method significantly improves ELECTRA's average performance by 2.8% and 3.2% absolute points respectively on GLUE and SQuAD 2.0 benchmarks.
arXiv Detail & Related papers (2023-05-06T09:02:10Z) - Improving Rare Word Recognition with LM-aware MWER Training [50.241159623691885]
We introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework.
For the shallow fusion setup, we use LMs during both hypotheses generation and loss computation, and the LM-aware MWER-trained model achieves 10% relative improvement.
For the rescoring setup, we learn a small neural module to generate per-token fusion weights in a data-dependent manner.
arXiv Detail & Related papers (2022-04-15T17:19:41Z) - Internal Language Model Adaptation with Text-Only Data for End-to-End
Speech Recognition [80.32546870220979]
We propose an internal LM adaptation (ILMA) of the E2E model using text-only data.
ILMA enables a fast text-only adaptation of the E2E model without increasing the run-time computational cost.
Experimented with 30K-hour trained transformer transducer models, ILMA achieves up to 34.9% relative word error rate reduction.
arXiv Detail & Related papers (2021-10-06T23:03:29Z) - Internal Language Model Training for Domain-Adaptive End-to-End Speech
Recognition [83.739317674302]
Internal language model estimation (ILME) method can be used to improve integration between external language models and automatic speech recognition systems.
We propose an internal LM training (ILMT) method to minimize an additional internal LM loss.
ILMT encourages the E2E model to form a standalone LM inside its existing components, without sacrificing ASR accuracy.
arXiv Detail & Related papers (2021-02-02T08:15:02Z) - Internal Language Model Estimation for Domain-Adaptive End-to-End Speech
Recognition [56.27081731553829]
Internal language models (LM) integration is a challenging task for end-to-end (E2E) automatic speech recognition.
We propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models.
ILME can alleviate the domain mismatch between training and testing, or improve the multi-domain E2E ASR.
arXiv Detail & Related papers (2020-11-03T20:11:04Z) - On Minimum Word Error Rate Training of the Hybrid Autoregressive
Transducer [40.63693071222628]
We study the minimum word error rate (MWER) training of Hybrid Autoregressive Transducer (HAT)
From experiments with around 30,000 hours of training data, we show that MWER training can improve the accuracy of HAT models.
arXiv Detail & Related papers (2020-10-23T21:16:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.