Acoustic Model Fusion for End-to-end Speech Recognition
- URL: http://arxiv.org/abs/2310.07062v1
- Date: Tue, 10 Oct 2023 23:00:17 GMT
- Title: Acoustic Model Fusion for End-to-end Speech Recognition
- Authors: Zhihong Lei, Mingbin Xu, Shiyi Han, Leo Liu, Zhen Huang, Tim Ng,
Yuanyuan Zhang, Ernest Pusateri, Mirko Hannemann, Yaqiao Deng, Man-Hung Siu
- Abstract summary: Speech recognition systems implicitly model all conventional ASR components, such as the acoustic model (AM) and the language model (LM)
We propose the integration of an external AM into the E2E system to better address the domain mismatch.
We have achieved a significant reduction in the word error rate, with an impressive drop of up to 14.3% across varied test sets.
- Score: 7.431401982826315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in deep learning and automatic speech recognition (ASR) have
enabled the end-to-end (E2E) ASR system and boosted the accuracy to a new
level. The E2E systems implicitly model all conventional ASR components, such
as the acoustic model (AM) and the language model (LM), in a single network
trained on audio-text pairs. Despite this simpler system architecture, fusing a
separate LM, trained exclusively on text corpora, into the E2E system has
proven to be beneficial. However, the application of LM fusion presents certain
drawbacks, such as its inability to address the domain mismatch issue inherent
to the internal AM. Drawing inspiration from the concept of LM fusion, we
propose the integration of an external AM into the E2E system to better address
the domain mismatch. By implementing this novel approach, we have achieved a
significant reduction in the word error rate, with an impressive drop of up to
14.3% across varied test sets. We also discovered that this AM fusion approach
is particularly beneficial in enhancing named entity recognition.
Related papers
- It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Decoupled Structure for Improved Adaptability of End-to-End Models [16.195423291103975]
This paper proposes decoupled structures for attention-based encoder-decoder (Decoupled-AED) and neural transducer (Decoupled-Transducer) models.
The acoustic and linguistic parts of the E2E model decoder (or prediction network) are decoupled, making the linguistic component replaceable.
Experiments for E2E ASR models trained on the Libri-100h corpus showed that the proposed decoupled structure gave 15.1% and 17.2% relative word error rate reductions.
arXiv Detail & Related papers (2023-08-25T12:31:12Z) - Contextual Density Ratio for Language Model Biasing of Sequence to
Sequence ASR Systems [2.4909170697740963]
We propose a contextual density ratio approach for both training a contextual aware E2E model and adapting the language model to named entities.
Our proposed technique achieves a relative improvement of up to 46.5% on the names over an E2E baseline without degrading the overall recognition accuracy of the whole test set.
arXiv Detail & Related papers (2022-06-29T13:12:46Z) - Consistent Training and Decoding For End-to-end Speech Recognition Using
Lattice-free MMI [67.13999010060057]
We propose a novel approach to integrate LF-MMI criterion into E2E ASR frameworks in both training and decoding stages.
Experiments suggest that the introduction of the LF-MMI criterion consistently leads to significant performance improvements.
arXiv Detail & Related papers (2021-12-05T07:30:17Z) - Internal Language Model Adaptation with Text-Only Data for End-to-End
Speech Recognition [80.32546870220979]
We propose an internal LM adaptation (ILMA) of the E2E model using text-only data.
ILMA enables a fast text-only adaptation of the E2E model without increasing the run-time computational cost.
Experimented with 30K-hour trained transformer transducer models, ILMA achieves up to 34.9% relative word error rate reduction.
arXiv Detail & Related papers (2021-10-06T23:03:29Z) - Internal Language Model Training for Domain-Adaptive End-to-End Speech
Recognition [83.739317674302]
Internal language model estimation (ILME) method can be used to improve integration between external language models and automatic speech recognition systems.
We propose an internal LM training (ILMT) method to minimize an additional internal LM loss.
ILMT encourages the E2E model to form a standalone LM inside its existing components, without sacrificing ASR accuracy.
arXiv Detail & Related papers (2021-02-02T08:15:02Z) - Internal Language Model Estimation for Domain-Adaptive End-to-End Speech
Recognition [56.27081731553829]
Internal language models (LM) integration is a challenging task for end-to-end (E2E) automatic speech recognition.
We propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models.
ILME can alleviate the domain mismatch between training and testing, or improve the multi-domain E2E ASR.
arXiv Detail & Related papers (2020-11-03T20:11:04Z) - An Effective End-to-End Modeling Approach for Mispronunciation Detection [12.113290059233977]
We present a novel use of CTCAttention approach to the Mispronunciation detection task.
We also perform input augmentation with text prompt information to make the resulting E2E model more tailored for the MD task.
A series of Mandarin MD experiments demonstrate that our approach brings about systematic and substantial performance improvements.
arXiv Detail & Related papers (2020-05-18T03:37:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.