Have best of both worlds: two-pass hybrid and E2E cascading framework
for speech recognition
- URL: http://arxiv.org/abs/2110.04891v1
- Date: Sun, 10 Oct 2021 20:11:38 GMT
- Title: Have best of both worlds: two-pass hybrid and E2E cascading framework
for speech recognition
- Authors: Guoli Ye, Vadim Mazalov, Jinyu Li and Yifan Gong
- Abstract summary: Hybrid and end-to-end (E2E) systems have different error patterns in the speech recognition results.
This paper proposes a two-pass hybrid and E2E cascading (HEC) framework to combine the hybrid and E2E model.
We show that the proposed system achieves 8-10% relative word error rate reduction with respect to each individual system.
- Score: 71.30167252138048
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hybrid and end-to-end (E2E) systems have their individual advantages, with
different error patterns in the speech recognition results. By jointly modeling
audio and text, the E2E model performs better in matched scenarios and scales
well with a large amount of paired audio-text training data. The modularized
hybrid model is easier for customization, and better to make use of a massive
amount of unpaired text data. This paper proposes a two-pass hybrid and E2E
cascading (HEC) framework to combine the hybrid and E2E model in order to take
advantage of both sides, with hybrid in the first pass and E2E in the second
pass. We show that the proposed system achieves 8-10% relative word error rate
reduction with respect to each individual system. More importantly, compared
with the pure E2E system, we show the proposed system has the potential to keep
the advantages of hybrid system, e.g., customization and segmentation
capabilities. We also show the second pass E2E model in HEC is robust with
respect to the change in the first pass hybrid model.
Related papers
- Leveraging Large Text Corpora for End-to-End Speech Summarization [58.673480990374635]
End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech.
We present two novel methods that leverage a large amount of external text summarization data for E2E SSum training.
arXiv Detail & Related papers (2023-03-02T05:19:49Z) - JEIT: Joint End-to-End Model and Internal Language Model Training for
Speech Recognition [63.38229762589485]
We propose a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM.
With 100B unpaired sentences, JEIT/CJJT improves rare-word recognition accuracy by up to 16.4% over a model trained without unpaired text.
arXiv Detail & Related papers (2023-02-16T21:07:38Z) - Contextual Density Ratio for Language Model Biasing of Sequence to
Sequence ASR Systems [2.4909170697740963]
We propose a contextual density ratio approach for both training a contextual aware E2E model and adapting the language model to named entities.
Our proposed technique achieves a relative improvement of up to 46.5% on the names over an E2E baseline without degrading the overall recognition accuracy of the whole test set.
arXiv Detail & Related papers (2022-06-29T13:12:46Z) - Two-pass Decoding and Cross-adaptation Based System Combination of
End-to-end Conformer and Hybrid TDNN ASR Systems [61.90743116707422]
This paper investigates multi-pass rescoring and cross adaptation based system combination approaches for hybrid TDNN and Conformer E2E ASR systems.
The best combined system obtained using multi-pass rescoring produced statistically significant word error rate (WER) reductions of 2.5% to 3.9% absolute (22.5% to 28.9% relative) over the stand alone Conformer system on the NIST Hub5'00, Rt03 and Rt02 evaluation data.
arXiv Detail & Related papers (2022-06-23T10:17:13Z) - Consistent Training and Decoding For End-to-end Speech Recognition Using
Lattice-free MMI [67.13999010060057]
We propose a novel approach to integrate LF-MMI criterion into E2E ASR frameworks in both training and decoding stages.
Experiments suggest that the introduction of the LF-MMI criterion consistently leads to significant performance improvements.
arXiv Detail & Related papers (2021-12-05T07:30:17Z) - Internal Language Model Adaptation with Text-Only Data for End-to-End
Speech Recognition [80.32546870220979]
We propose an internal LM adaptation (ILMA) of the E2E model using text-only data.
ILMA enables a fast text-only adaptation of the E2E model without increasing the run-time computational cost.
Experimented with 30K-hour trained transformer transducer models, ILMA achieves up to 34.9% relative word error rate reduction.
arXiv Detail & Related papers (2021-10-06T23:03:29Z) - Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end
Spoken Language Understanding [14.752834813510702]
We treat an E2E system as a multi-modal model, with audio and text functioning as its two modalities.
We propose using different multi-modal losses to guide the acoustic embeddings to be closer to the text embeddings.
We train the CMLS model on two publicly available E2E datasets, across different cross-modal losses and show that our proposed triplet loss function achieves the best performance.
arXiv Detail & Related papers (2020-11-18T02:32:42Z) - An Effective End-to-End Modeling Approach for Mispronunciation Detection [12.113290059233977]
We present a novel use of CTCAttention approach to the Mispronunciation detection task.
We also perform input augmentation with text prompt information to make the resulting E2E model more tailored for the MD task.
A series of Mandarin MD experiments demonstrate that our approach brings about systematic and substantial performance improvements.
arXiv Detail & Related papers (2020-05-18T03:37:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.