Improving Tail Performance of a Deliberation E2E ASR Model Using a Large
Text Corpus
- URL: http://arxiv.org/abs/2008.10491v2
- Date: Tue, 25 Aug 2020 12:50:55 GMT
- Title: Improving Tail Performance of a Deliberation E2E ASR Model Using a Large
Text Corpus
- Authors: Cal Peyser, Sepand Mavandadi, Tara N. Sainath, James Apfel, Ruoming
Pang, Shankar Kumar
- Abstract summary: End-to-end (E2E) automatic speech recognition systems lack the distinct language model (LM) component that characterizes traditional speech systems.
shallow fusion has been proposed as a method for incorporating a pre-trained LM into an E2E model at inference time.
We apply shallow fusion to incorporate a very large text corpus into a state-of-the-art E2EASR model.
- Score: 35.45918249451485
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end (E2E) automatic speech recognition (ASR) systems lack the distinct
language model (LM) component that characterizes traditional speech systems.
While this simplifies the model architecture, it complicates the task of
incorporating text-only data into training, which is important to the
recognition of tail words that do not occur often in audio-text pairs. While
shallow fusion has been proposed as a method for incorporating a pre-trained LM
into an E2E model at inference time, it has not yet been explored for very
large text corpora, and it has been shown to be very sensitive to
hyperparameter settings in the beam search. In this work, we apply shallow
fusion to incorporate a very large text corpus into a state-of-the-art E2EASR
model. We explore the impact of model size and show that intelligent pruning of
the training set can be more effective than increasing the parameter count.
Additionally, we show that incorporating the LM in minimum word error rate
(MWER) fine tuning makes shallow fusion far less dependent on optimal
hyperparameter settings, reducing the difficulty of that tuning problem.
Related papers
- EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - Leveraging Large Text Corpora for End-to-End Speech Summarization [58.673480990374635]
End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech.
We present two novel methods that leverage a large amount of external text summarization data for E2E SSum training.
arXiv Detail & Related papers (2023-03-02T05:19:49Z) - JEIT: Joint End-to-End Model and Internal Language Model Training for
Speech Recognition [63.38229762589485]
We propose a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM.
With 100B unpaired sentences, JEIT/CJJT improves rare-word recognition accuracy by up to 16.4% over a model trained without unpaired text.
arXiv Detail & Related papers (2023-02-16T21:07:38Z) - Contextual Density Ratio for Language Model Biasing of Sequence to
Sequence ASR Systems [2.4909170697740963]
We propose a contextual density ratio approach for both training a contextual aware E2E model and adapting the language model to named entities.
Our proposed technique achieves a relative improvement of up to 46.5% on the names over an E2E baseline without degrading the overall recognition accuracy of the whole test set.
arXiv Detail & Related papers (2022-06-29T13:12:46Z) - Improving Rare Word Recognition with LM-aware MWER Training [50.241159623691885]
We introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework.
For the shallow fusion setup, we use LMs during both hypotheses generation and loss computation, and the LM-aware MWER-trained model achieves 10% relative improvement.
For the rescoring setup, we learn a small neural module to generate per-token fusion weights in a data-dependent manner.
arXiv Detail & Related papers (2022-04-15T17:19:41Z) - A Complementary Joint Training Approach Using Unpaired Speech and Text
for Low-Resource Automatic Speech Recognition [25.473191378558138]
We leverage unpaired data to train a general sequence-to-sequence model.
Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair, we propose a complementary joint training(CJT) method.
arXiv Detail & Related papers (2022-04-05T07:02:53Z) - JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to
Speech [7.476901945542385]
We present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models.
Our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module.
Experiments on LJSpeech corpus shows that the proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS.
arXiv Detail & Related papers (2022-03-31T07:25:11Z) - Internal Language Model Adaptation with Text-Only Data for End-to-End
Speech Recognition [80.32546870220979]
We propose an internal LM adaptation (ILMA) of the E2E model using text-only data.
ILMA enables a fast text-only adaptation of the E2E model without increasing the run-time computational cost.
Experimented with 30K-hour trained transformer transducer models, ILMA achieves up to 34.9% relative word error rate reduction.
arXiv Detail & Related papers (2021-10-06T23:03:29Z) - A Full Text-Dependent End to End Mispronunciation Detection and
Diagnosis with Easy Data Augmentation Techniques [28.59181595057581]
We present a novel text-dependent model which is difference with sed-mdd.
We propose three simple data augmentation methods, which effectively improve the ability of model to capture mispronounced phonemes.
arXiv Detail & Related papers (2021-04-17T03:11:41Z) - Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for
Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model.
Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.