Internal Language Model Training for Domain-Adaptive End-to-End Speech
Recognition
- URL: http://arxiv.org/abs/2102.01380v1
- Date: Tue, 2 Feb 2021 08:15:02 GMT
- Title: Internal Language Model Training for Domain-Adaptive End-to-End Speech
Recognition
- Authors: Zhong Meng, Naoyuki Kanda, Yashesh Gaur, Sarangarajan Parthasarathy,
Eric Sun, Liang Lu, Xie Chen, Jinyu Li, Yifan Gong
- Abstract summary: Internal language model estimation (ILME) method can be used to improve integration between external language models and automatic speech recognition systems.
We propose an internal LM training (ILMT) method to minimize an additional internal LM loss.
ILMT encourages the E2E model to form a standalone LM inside its existing components, without sacrificing ASR accuracy.
- Score: 83.739317674302
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The efficacy of external language model (LM) integration with existing
end-to-end (E2E) automatic speech recognition (ASR) systems can be improved
significantly using the internal language model estimation (ILME) method. In
this method, the internal LM score is subtracted from the score obtained by
interpolating the E2E score with the external LM score, during inference. To
improve the ILME-based inference, we propose an internal LM training (ILMT)
method to minimize an additional internal LM loss by updating only the E2E
model components that affect the internal LM estimation. ILMT encourages the
E2E model to form a standalone LM inside its existing components, without
sacrificing ASR accuracy. After ILMT, the more modular E2E model with matched
training and inference criteria enables a more thorough elimination of the
source-domain internal LM, and therefore leads to a more effective integration
of the target-domain external LM. Experimented with 30K-hour trained recurrent
neural network transducer and attention-based encoder-decoder models, ILMT with
ILME-based inference achieves up to 31.5% and 11.4% relative word error rate
reductions from standard E2E training with Shallow Fusion on out-of-domain
LibriSpeech and in-domain Microsoft production test sets, respectively.
Related papers
- Acoustic Model Fusion for End-to-end Speech Recognition [7.431401982826315]
Speech recognition systems implicitly model all conventional ASR components, such as the acoustic model (AM) and the language model (LM)
We propose the integration of an external AM into the E2E system to better address the domain mismatch.
We have achieved a significant reduction in the word error rate, with an impressive drop of up to 14.3% across varied test sets.
arXiv Detail & Related papers (2023-10-10T23:00:17Z) - Decoupled Structure for Improved Adaptability of End-to-End Models [16.195423291103975]
This paper proposes decoupled structures for attention-based encoder-decoder (Decoupled-AED) and neural transducer (Decoupled-Transducer) models.
The acoustic and linguistic parts of the E2E model decoder (or prediction network) are decoupled, making the linguistic component replaceable.
Experiments for E2E ASR models trained on the Libri-100h corpus showed that the proposed decoupled structure gave 15.1% and 17.2% relative word error rate reductions.
arXiv Detail & Related papers (2023-08-25T12:31:12Z) - On Language Model Integration for RNN Transducer based Speech
Recognition [49.84285563767935]
We study various ILM correction-based LM integration methods formulated in a common RNN-T framework.
We provide a decoding interpretation on two major reasons for performance improvement with ILM correction.
We also propose an exact-ILM training framework by extending the proof given in the hybrid autoregressive transducer.
arXiv Detail & Related papers (2021-10-13T16:30:46Z) - Internal Language Model Adaptation with Text-Only Data for End-to-End
Speech Recognition [80.32546870220979]
We propose an internal LM adaptation (ILMA) of the E2E model using text-only data.
ILMA enables a fast text-only adaptation of the E2E model without increasing the run-time computational cost.
Experimented with 30K-hour trained transformer transducer models, ILMA achieves up to 34.9% relative word error rate reduction.
arXiv Detail & Related papers (2021-10-06T23:03:29Z) - Minimum Word Error Rate Training with Language Model Fusion for
End-to-End Speech Recognition [82.60133751942854]
Internal language model estimation (ILME)-based LM fusion has shown significant word error rate (WER) reduction from Shallow Fusion.
We propose a novel MWER training with ILME (MWER-ILME) where the ILME-based fusion is conducted to generate N-best hypotheses and their posteriors.
MWER-ILME achieves on average 8.8% and 5.8% relative WER reductions from MWER and MWER-SF training, respectively, on 6 different test sets.
arXiv Detail & Related papers (2021-06-04T07:24:49Z) - Librispeech Transducer Model with Internal Language Model Prior
Correction [58.579080710256704]
We study variants to include an external language model (LM) with shallow fusion and subtract an estimated internal LM.
The subtraction of the internal LM gives us over 14% relative improvement over normal shallow fusion.
Our transducer has a separate probability distribution for the non-blank labels.
arXiv Detail & Related papers (2021-04-07T09:18:56Z) - Internal Language Model Estimation for Domain-Adaptive End-to-End Speech
Recognition [56.27081731553829]
Internal language models (LM) integration is a challenging task for end-to-end (E2E) automatic speech recognition.
We propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models.
ILME can alleviate the domain mismatch between training and testing, or improve the multi-domain E2E ASR.
arXiv Detail & Related papers (2020-11-03T20:11:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.