Internal Language Model Adaptation with Text-Only Data for End-to-End
Speech Recognition
- URL: http://arxiv.org/abs/2110.05354v1
- Date: Wed, 6 Oct 2021 23:03:29 GMT
- Title: Internal Language Model Adaptation with Text-Only Data for End-to-End
Speech Recognition
- Authors: Zhong Meng, Yashesh Gaur, Naoyuki Kanda, Jinyu Li, Xie Chen, Yu Wu,
Yifan Gong
- Abstract summary: We propose an internal LM adaptation (ILMA) of the E2E model using text-only data.
ILMA enables a fast text-only adaptation of the E2E model without increasing the run-time computational cost.
Experimented with 30K-hour trained transformer transducer models, ILMA achieves up to 34.9% relative word error rate reduction.
- Score: 80.32546870220979
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-only adaptation of an end-to-end (E2E) model remains a challenging task
for automatic speech recognition (ASR). Language model (LM) fusion-based
approaches require an additional external LM during inference, significantly
increasing the computation cost. To overcome this, we propose an internal LM
adaptation (ILMA) of the E2E model using text-only data. Trained with
audio-transcript pairs, an E2E model implicitly learns an internal LM that
characterizes the token sequence probability which is approximated by the E2E
model output after zeroing out the encoder contribution. During ILMA, we
fine-tune the internal LM, i.e., the E2E components excluding the encoder, to
minimize a cross-entropy loss. To make ILMA effective, it is essential to train
the E2E model with an internal LM loss besides the standard E2E loss.
Furthermore, we propose to regularize ILMA by minimizing the Kullback-Leibler
divergence between the output distributions of the adapted and unadapted
internal LMs. ILMA is the most effective when we update only the last linear
layer of the joint network. ILMA enables a fast text-only adaptation of the E2E
model without increasing the run-time computational cost. Experimented with
30K-hour trained transformer transducer models, ILMA achieves up to 34.9%
relative word error rate reduction from the unadapted baseline.
Related papers
- Acoustic Model Fusion for End-to-end Speech Recognition [7.431401982826315]
Speech recognition systems implicitly model all conventional ASR components, such as the acoustic model (AM) and the language model (LM)
We propose the integration of an external AM into the E2E system to better address the domain mismatch.
We have achieved a significant reduction in the word error rate, with an impressive drop of up to 14.3% across varied test sets.
arXiv Detail & Related papers (2023-10-10T23:00:17Z) - Decoupled Structure for Improved Adaptability of End-to-End Models [16.195423291103975]
This paper proposes decoupled structures for attention-based encoder-decoder (Decoupled-AED) and neural transducer (Decoupled-Transducer) models.
The acoustic and linguistic parts of the E2E model decoder (or prediction network) are decoupled, making the linguistic component replaceable.
Experiments for E2E ASR models trained on the Libri-100h corpus showed that the proposed decoupled structure gave 15.1% and 17.2% relative word error rate reductions.
arXiv Detail & Related papers (2023-08-25T12:31:12Z) - JEIT: Joint End-to-End Model and Internal Language Model Training for
Speech Recognition [63.38229762589485]
We propose a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM.
With 100B unpaired sentences, JEIT/CJJT improves rare-word recognition accuracy by up to 16.4% over a model trained without unpaired text.
arXiv Detail & Related papers (2023-02-16T21:07:38Z) - Modular Hybrid Autoregressive Transducer [51.29870462504761]
Text-only adaptation of a transducer model remains challenging for end-to-end speech recognition.
We propose a modular hybrid autoregressive transducer that has structurally separated label and blank decoders.
On Google's large-scale production data, a multi-domain MHAT adapted with 100B sentences achieves relative WER reductions of up to 12.4% without LM fusion.
arXiv Detail & Related papers (2022-10-31T03:56:37Z) - Minimum Word Error Rate Training with Language Model Fusion for
End-to-End Speech Recognition [82.60133751942854]
Internal language model estimation (ILME)-based LM fusion has shown significant word error rate (WER) reduction from Shallow Fusion.
We propose a novel MWER training with ILME (MWER-ILME) where the ILME-based fusion is conducted to generate N-best hypotheses and their posteriors.
MWER-ILME achieves on average 8.8% and 5.8% relative WER reductions from MWER and MWER-SF training, respectively, on 6 different test sets.
arXiv Detail & Related papers (2021-06-04T07:24:49Z) - Internal Language Model Training for Domain-Adaptive End-to-End Speech
Recognition [83.739317674302]
Internal language model estimation (ILME) method can be used to improve integration between external language models and automatic speech recognition systems.
We propose an internal LM training (ILMT) method to minimize an additional internal LM loss.
ILMT encourages the E2E model to form a standalone LM inside its existing components, without sacrificing ASR accuracy.
arXiv Detail & Related papers (2021-02-02T08:15:02Z) - Internal Language Model Estimation for Domain-Adaptive End-to-End Speech
Recognition [56.27081731553829]
Internal language models (LM) integration is a challenging task for end-to-end (E2E) automatic speech recognition.
We propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models.
ILME can alleviate the domain mismatch between training and testing, or improve the multi-domain E2E ASR.
arXiv Detail & Related papers (2020-11-03T20:11:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.