Factorized Neural Transducer for Efficient Language Model Adaptation
- URL: http://arxiv.org/abs/2110.01500v4
- Date: Thu, 7 Oct 2021 15:09:59 GMT
- Title: Factorized Neural Transducer for Efficient Language Model Adaptation
- Authors: Xie Chen, Zhong Meng, Sarangarajan Parthasarathy, Jinyu Li
- Abstract summary: We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
- Score: 51.81097243306204
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, end-to-end (E2E) based automatic speech recognition (ASR)
systems have achieved great success due to their simplicity and promising
performance. Neural Transducer based models are increasingly popular in
streaming E2E based ASR systems and have been reported to outperform the
traditional hybrid system in some scenarios. However, the joint optimization of
acoustic model, lexicon and language model in neural Transducer also brings
about challenges to utilize pure text for language model adaptation. This
drawback might prevent their potential applications in practice. In order to
address this issue, in this paper, we propose a novel model, factorized neural
Transducer, by factorizing the blank and vocabulary prediction, and adopting a
standalone language model for the vocabulary prediction. It is expected that
this factorization can transfer the improvement of the standalone language
model to the Transducer for speech recognition, which allows various language
model adaptation techniques to be applied. We demonstrate that the proposed
factorized neural Transducer yields 15% to 20% WER improvements when
out-of-domain text data is used for language model adaptation, at the cost of a
minor degradation in WER on a general test set.
Related papers
- COPAL: Continual Pruning in Large Language Generative Models [23.747878534962663]
COPAL is an algorithm developed for pruning large language generative models under a continual model adaptation setting.
Our empirical evaluation on a various size of LLMs show that COPAL outperforms baseline models.
arXiv Detail & Related papers (2024-05-02T18:24:41Z) - Hybrid Attention-based Encoder-decoder Model for Efficient Language Model Adaptation [13.16188747098854]
We propose a novel attention-based encoder-decoder (HAED) speech recognition model.
Our model separates the acoustic and language models, allowing for the use of conventional text-based language model adaptation techniques.
We demonstrate that the proposed HAED model yields 23% relative Word Error Rate (WER) improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2023-09-14T01:07:36Z) - Fast and accurate factorized neural transducer for text adaption of
end-to-end speech recognition models [23.21666928497697]
The improved adaptation ability of Factorized neural transducer (FNT) on text-only adaptation data came at the cost of lowered accuracy compared to the standard neural transducer model.
A combination of these approaches results in a relative word-error-rate reduction of 9.48% from the standard FNT model.
arXiv Detail & Related papers (2022-12-05T02:52:21Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Dependency-based Mixture Language Models [53.152011258252315]
We introduce the Dependency-based Mixture Language Models.
In detail, we first train neural language models with a novel dependency modeling objective.
We then formulate the next-token probability by mixing the previous dependency modeling probability distributions with self-attention.
arXiv Detail & Related papers (2022-03-19T06:28:30Z) - Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network [0.0]
We show that RNN-transducer models can be effectively adapted to new domains using only small amounts of textual data.
We show with multiple ASR evaluation tasks how this method can provide relative gains of 10-45% in target task WER.
arXiv Detail & Related papers (2021-04-22T15:21:41Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - DiscreTalk: Text-to-Speech as a Machine Translation Problem [52.33785857500754]
This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT)
The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model.
arXiv Detail & Related papers (2020-05-12T02:45:09Z) - Hybrid Autoregressive Transducer (hat) [11.70833387055716]
This paper proposes and evaluates the hybrid autoregressive transducer (HAT) model.
It is a time-synchronous encoderdecoder model that preserves the modularity of conventional automatic speech recognition systems.
We evaluate our proposed model on a large-scale voice search task.
arXiv Detail & Related papers (2020-03-12T20:47:06Z) - Rnn-transducer with language bias for end-to-end Mandarin-English
code-switching speech recognition [58.105818353866354]
We propose an improved recurrent neural network transducer (RNN-T) model with language bias to alleviate the problem.
We use the language identities to bias the model to predict the CS points.
This promotes the model to learn the language identity information directly from transcription, and no additional LID model is needed.
arXiv Detail & Related papers (2020-02-19T12:01:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.