An Empirical Study of Language Model Integration for Transducer based
Speech Recognition
- URL: http://arxiv.org/abs/2203.16776v1
- Date: Thu, 31 Mar 2022 03:33:50 GMT
- Title: An Empirical Study of Language Model Integration for Transducer based
Speech Recognition
- Authors: Huahuan Zheng, Keyu An, Zhijian Ou, Chen Huang, Ke Ding, Guanglu Wan
- Abstract summary: Methods such as density ratio (DR) and ILM estimation (ILME) have been developed, outperforming the classic shallow fusion (SF) method.
We propose a low-order density ratio method (LODR) by training a low-order weak ILM for DR.
- Score: 23.759084092602517
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Utilizing text-only data with an external language model (LM) in end-to-end
RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class
of methods such as density ratio (DR) and ILM estimation (ILME) have been
developed, outperforming the classic shallow fusion (SF) method. The basic idea
behind these methods is that RNN-T posterior should first subtract the
implicitly learned ILM prior, in order to integrate the external LM. While
recent studies suggest that RNN-T only learns some low-order language model
information, the DR method uses a well-trained ILM. We hypothesize that this
setting is appropriate and may deteriorate the performance of the DR method,
and propose a low-order density ratio method (LODR) by training a low-order
weak ILM for DR. Extensive empirical experiments are conducted on both
in-domain and cross-domain scenarios on English LibriSpeech & Tedlium-2 and
Chinese WenetSpeech & AISHELL-1 datasets. It is shown that LODR consistently
outperforms SF in all tasks, while performing generally close to ILME and
better than DR in most tests.
Related papers
- Effective internal language model training and fusion for factorized transducer model [26.371223360905557]
Internal language model (ILM) of the neural transducer has been widely studied.
We propose a novel ILM training and decoding strategy for factorized transducer models.
arXiv Detail & Related papers (2024-04-02T08:01:05Z) - It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - On the Relation between Internal Language Model and Sequence Discriminative Training for Neural Transducers [52.88268942796418]
Internal language model (ILM) subtraction has been widely applied to improve the performance of the RNN-Transducer.
We show that sequence discriminative training has a strong correlation with ILM subtraction from both theoretical and empirical points of view.
arXiv Detail & Related papers (2023-09-25T13:35:28Z) - On Language Model Integration for RNN Transducer based Speech
Recognition [49.84285563767935]
We study various ILM correction-based LM integration methods formulated in a common RNN-T framework.
We provide a decoding interpretation on two major reasons for performance improvement with ILM correction.
We also propose an exact-ILM training framework by extending the proof given in the hybrid autoregressive transducer.
arXiv Detail & Related papers (2021-10-13T16:30:46Z) - Cross-sentence Neural Language Models for Conversational Speech
Recognition [17.317583079824423]
We propose an effective cross-sentence neural LM approach that reranks the ASR N-best hypotheses of an upcoming sentence.
We also explore to extract task-specific global topical information of the cross-sentence history.
arXiv Detail & Related papers (2021-06-13T05:30:16Z) - Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in
Non-Autoregressive Translation [98.11249019844281]
Knowledge distillation (KD) is commonly used to construct synthetic data for training non-autoregressive translation (NAT) models.
We propose reverse KD to rejuvenate more alignments for low-frequency target words.
Results demonstrate that the proposed approach can significantly and universally improve translation quality.
arXiv Detail & Related papers (2021-06-02T02:41:40Z) - Language Model Prior for Low-Resource Neural Machine Translation [85.55729693003829]
We propose a novel approach to incorporate a LM as prior in a neural translation model (TM)
We add a regularization term, which pushes the output distributions of the TM to be probable under the LM prior.
Results on two low-resource machine translation datasets show clear improvements even with limited monolingual data.
arXiv Detail & Related papers (2020-04-30T16:29:56Z) - A Density Ratio Approach to Language Model Fusion in End-To-End
Automatic Speech Recognition [9.184319271887531]
This article describes a density ratio approach to integrating external Language Models (LMs) into end-to-end models for Automatic Speech Recognition (ASR)
An RNN-T ASR model trained on paired audio & transcript data from YouTube is evaluated for its ability to generalize to Voice Search data.
arXiv Detail & Related papers (2020-02-26T02:53:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.