JEIT: Joint End-to-End Model and Internal Language Model Training for
Speech Recognition
- URL: http://arxiv.org/abs/2302.08583v1
- Date: Thu, 16 Feb 2023 21:07:38 GMT
- Title: JEIT: Joint End-to-End Model and Internal Language Model Training for
Speech Recognition
- Authors: Zhong Meng, Weiran Wang, Rohit Prabhavalkar, Tara N. Sainath, Tongzhou
Chen, Ehsan Variani, Yu Zhang, Bo Li, Andrew Rosenberg, Bhuvana Ramabhadran
- Abstract summary: We propose a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM.
With 100B unpaired sentences, JEIT/CJJT improves rare-word recognition accuracy by up to 16.4% over a model trained without unpaired text.
- Score: 63.38229762589485
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose JEIT, a joint end-to-end (E2E) model and internal language model
(ILM) training method to inject large-scale unpaired text into ILM during E2E
training which improves rare-word speech recognition. With JEIT, the E2E model
computes an E2E loss on audio-transcript pairs while its ILM estimates a
cross-entropy loss on unpaired text. The E2E model is trained to minimize a
weighted sum of E2E and ILM losses. During JEIT, ILM absorbs knowledge from
unpaired text while the E2E training serves as regularization. Unlike ILM
adaptation methods, JEIT does not require a separate adaptation step and avoids
the need for Kullback-Leibler divergence regularization of ILM. We also show
that modular hybrid autoregressive transducer (MHAT) performs better than HAT
in the JEIT framework, and is much more robust than HAT during ILM adaptation.
To push the limit of unpaired text injection, we further propose a combined
JEIT and JOIST training (CJJT) that benefits from modality matching, encoder
text injection and ILM training. Both JEIT and CJJT can foster a more effective
LM fusion. With 100B unpaired sentences, JEIT/CJJT improves rare-word
recognition accuracy by up to 16.4% over a model trained without unpaired text.
Related papers
- Leveraging Large Text Corpora for End-to-End Speech Summarization [58.673480990374635]
End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech.
We present two novel methods that leverage a large amount of external text summarization data for E2E SSum training.
arXiv Detail & Related papers (2023-03-02T05:19:49Z) - Improving Rare Word Recognition with LM-aware MWER Training [50.241159623691885]
We introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework.
For the shallow fusion setup, we use LMs during both hypotheses generation and loss computation, and the LM-aware MWER-trained model achieves 10% relative improvement.
For the rescoring setup, we learn a small neural module to generate per-token fusion weights in a data-dependent manner.
arXiv Detail & Related papers (2022-04-15T17:19:41Z) - Internal Language Model Adaptation with Text-Only Data for End-to-End
Speech Recognition [80.32546870220979]
We propose an internal LM adaptation (ILMA) of the E2E model using text-only data.
ILMA enables a fast text-only adaptation of the E2E model without increasing the run-time computational cost.
Experimented with 30K-hour trained transformer transducer models, ILMA achieves up to 34.9% relative word error rate reduction.
arXiv Detail & Related papers (2021-10-06T23:03:29Z) - Internal Language Model Training for Domain-Adaptive End-to-End Speech
Recognition [83.739317674302]
Internal language model estimation (ILME) method can be used to improve integration between external language models and automatic speech recognition systems.
We propose an internal LM training (ILMT) method to minimize an additional internal LM loss.
ILMT encourages the E2E model to form a standalone LM inside its existing components, without sacrificing ASR accuracy.
arXiv Detail & Related papers (2021-02-02T08:15:02Z) - Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end
Spoken Language Understanding [14.752834813510702]
We treat an E2E system as a multi-modal model, with audio and text functioning as its two modalities.
We propose using different multi-modal losses to guide the acoustic embeddings to be closer to the text embeddings.
We train the CMLS model on two publicly available E2E datasets, across different cross-modal losses and show that our proposed triplet loss function achieves the best performance.
arXiv Detail & Related papers (2020-11-18T02:32:42Z) - Internal Language Model Estimation for Domain-Adaptive End-to-End Speech
Recognition [56.27081731553829]
Internal language models (LM) integration is a challenging task for end-to-end (E2E) automatic speech recognition.
We propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models.
ILME can alleviate the domain mismatch between training and testing, or improve the multi-domain E2E ASR.
arXiv Detail & Related papers (2020-11-03T20:11:04Z) - Improving Tail Performance of a Deliberation E2E ASR Model Using a Large
Text Corpus [35.45918249451485]
End-to-end (E2E) automatic speech recognition systems lack the distinct language model (LM) component that characterizes traditional speech systems.
shallow fusion has been proposed as a method for incorporating a pre-trained LM into an E2E model at inference time.
We apply shallow fusion to incorporate a very large text corpus into a state-of-the-art E2EASR model.
arXiv Detail & Related papers (2020-08-24T14:53:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.