Hybrid Autoregressive Transducer (hat)
- URL: http://arxiv.org/abs/2003.07705v1
- Date: Thu, 12 Mar 2020 20:47:06 GMT
- Title: Hybrid Autoregressive Transducer (hat)
- Authors: Ehsan Variani, David Rybach, Cyril Allauzen, Michael Riley
- Abstract summary: This paper proposes and evaluates the hybrid autoregressive transducer (HAT) model.
It is a time-synchronous encoderdecoder model that preserves the modularity of conventional automatic speech recognition systems.
We evaluate our proposed model on a large-scale voice search task.
- Score: 11.70833387055716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes and evaluates the hybrid autoregressive transducer (HAT)
model, a time-synchronous encoderdecoder model that preserves the modularity of
conventional automatic speech recognition systems. The HAT model provides a way
to measure the quality of the internal language model that can be used to
decide whether inference with an external language model is beneficial or not.
This article also presents a finite context version of the HAT model that
addresses the exposure bias problem and significantly simplifies the overall
training and inference. We evaluate our proposed model on a large-scale voice
search task. Our experiments show significant improvements in WER compared to
the state-of-the-art approaches.
Related papers
- QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Hybrid Attention-based Encoder-decoder Model for Efficient Language Model Adaptation [13.16188747098854]
We propose a novel attention-based encoder-decoder (HAED) speech recognition model.
Our model separates the acoustic and language models, allowing for the use of conventional text-based language model adaptation techniques.
We demonstrate that the proposed HAED model yields 23% relative Word Error Rate (WER) improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2023-09-14T01:07:36Z) - Incorporating Casual Analysis into Diversified and Logical Response
Generation [14.4586344491264]
Conditional Variational AutoEncoder (CVAE) model can generate more diversified responses than the traditional Seq2Seq model.
We propose to predict the mediators to preserve relevant information and auto-regressively incorporate the mediators into generating process.
arXiv Detail & Related papers (2022-09-20T05:51:11Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - Non-autoregressive Transformer-based End-to-end ASR using BERT [13.07939371864781]
This paper presents a transformer-based end-to-end automatic speech recognition (ASR) model based on BERT.
A series of experiments conducted on the AISHELL-1 dataset demonstrates competitive or superior results.
arXiv Detail & Related papers (2021-04-10T16:22:17Z) - Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis [76.39883780990489]
We analyze the behavior of non-autoregressive TTS models under different prosody-modeling settings.
We propose a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features.
arXiv Detail & Related papers (2020-11-12T16:16:41Z) - Early Stage LM Integration Using Local and Global Log-Linear Combination [46.91755970827846]
Sequence-to-sequence models with an implicit alignment mechanism (e.g. attention) are closing the performance gap towards traditional hybrid hidden Markov models (HMM)
One important factor to improve word error rate in both cases is the use of an external language model (LM) trained on large text-only corpora.
We present a novel method for language model integration into implicit-alignment based sequence-to-sequence models.
arXiv Detail & Related papers (2020-05-20T13:49:55Z) - DiscreTalk: Text-to-Speech as a Machine Translation Problem [52.33785857500754]
This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT)
The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model.
arXiv Detail & Related papers (2020-05-12T02:45:09Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.