Modular Hybrid Autoregressive Transducer
- URL: http://arxiv.org/abs/2210.17049v1
- Date: Mon, 31 Oct 2022 03:56:37 GMT
- Title: Modular Hybrid Autoregressive Transducer
- Authors: Zhong Meng, Tongzhou Chen, Rohit Prabhavalkar, Yu Zhang, Gary Wang,
Kartik Audhkhasi, Jesse Emond, Trevor Strohman, Bhuvana Ramabhadran, W. Ronny
Huang, Ehsan Variani, Yinghui Huang, Pedro J. Moreno
- Abstract summary: Text-only adaptation of a transducer model remains challenging for end-to-end speech recognition.
We propose a modular hybrid autoregressive transducer that has structurally separated label and blank decoders.
On Google's large-scale production data, a multi-domain MHAT adapted with 100B sentences achieves relative WER reductions of up to 12.4% without LM fusion.
- Score: 51.29870462504761
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-only adaptation of a transducer model remains challenging for end-to-end
speech recognition since the transducer has no clearly separated acoustic model
(AM), language model (LM) or blank model. In this work, we propose a modular
hybrid autoregressive transducer (MHAT) that has structurally separated label
and blank decoders to predict label and blank distributions, respectively,
along with a shared acoustic encoder. The encoder and label decoder outputs are
directly projected to AM and internal LM scores and then added to compute label
posteriors. We train MHAT with an internal LM loss and a HAT loss to ensure
that its internal LM becomes a standalone neural LM that can be effectively
adapted to text. Moreover, text adaptation of MHAT fosters a much better LM
fusion than internal LM subtraction-based methods. On Google's large-scale
production data, a multi-domain MHAT adapted with 100B sentences achieves
relative WER reductions of up to 12.4% without LM fusion and 21.5% with LM
fusion from 400K-hour trained HAT.
Related papers
- It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - Internal Language Model Adaptation with Text-Only Data for End-to-End
Speech Recognition [80.32546870220979]
We propose an internal LM adaptation (ILMA) of the E2E model using text-only data.
ILMA enables a fast text-only adaptation of the E2E model without increasing the run-time computational cost.
Experimented with 30K-hour trained transformer transducer models, ILMA achieves up to 34.9% relative word error rate reduction.
arXiv Detail & Related papers (2021-10-06T23:03:29Z) - Investigating Methods to Improve Language Model Integration for
Attention-based Encoder-Decoder ASR Models [107.86965028729517]
Attention-based encoder-decoder (AED) models learn an implicit internal language model (ILM) from the training transcriptions.
We propose several novel methods to estimate the ILM directly from the AED model.
arXiv Detail & Related papers (2021-04-12T15:16:03Z) - Librispeech Transducer Model with Internal Language Model Prior
Correction [58.579080710256704]
We study variants to include an external language model (LM) with shallow fusion and subtract an estimated internal LM.
The subtraction of the internal LM gives us over 14% relative improvement over normal shallow fusion.
Our transducer has a separate probability distribution for the non-blank labels.
arXiv Detail & Related papers (2021-04-07T09:18:56Z) - Internal Language Model Training for Domain-Adaptive End-to-End Speech
Recognition [83.739317674302]
Internal language model estimation (ILME) method can be used to improve integration between external language models and automatic speech recognition systems.
We propose an internal LM training (ILMT) method to minimize an additional internal LM loss.
ILMT encourages the E2E model to form a standalone LM inside its existing components, without sacrificing ASR accuracy.
arXiv Detail & Related papers (2021-02-02T08:15:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.