Align-Refine: Non-Autoregressive Speech Recognition via Iterative
Realignment
- URL: http://arxiv.org/abs/2010.14233v1
- Date: Sat, 24 Oct 2020 09:35:37 GMT
- Title: Align-Refine: Non-Autoregressive Speech Recognition via Iterative
Realignment
- Authors: Ethan A. Chi, Julian Salazar, and Katrin Kirchhoff
- Abstract summary: Infilling and iterative refinement models make up some of this gap by editing the outputs of a non-autoregressive model.
We propose iterative realignment, where refinements occur over latent alignments rather than output sequence space.
- Score: 18.487842656780728
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-autoregressive models greatly improve decoding speed over typical
sequence-to-sequence models, but suffer from degraded performance. Infilling
and iterative refinement models make up some of this gap by editing the outputs
of a non-autoregressive model, but are constrained in the edits that they can
make. We propose iterative realignment, where refinements occur over latent
alignments rather than output sequence space. We demonstrate this in speech
recognition with Align-Refine, an end-to-end Transformer-based model which
refines connectionist temporal classification (CTC) alignments to allow
length-changing insertions and deletions. Align-Refine outperforms Imputer and
Mask-CTC, matching an autoregressive baseline on WSJ at 1/14th the real-time
factor and attaining a LibriSpeech test-other WER of 9.0% without an LM. Our
model is strong even in one iteration with a shallower decoder.
Related papers
- COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS)
MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z) - Non-autoregressive Sequence-to-Sequence Vision-Language Models [63.77614880533488]
We propose a parallel decoding sequence-to-sequence vision-language model that marginalizes over multiple inference paths in the decoder.
The model achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time.
arXiv Detail & Related papers (2024-03-04T17:34:59Z) - Conditional Denoising Diffusion for Sequential Recommendation [62.127862728308045]
Two prominent generative models, Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs)
GANs suffer from unstable optimization, while VAEs are prone to posterior collapse and over-smoothed generations.
We present a conditional denoising diffusion model, which includes a sequence encoder, a cross-attentive denoising decoder, and a step-wise diffuser.
arXiv Detail & Related papers (2023-04-22T15:32:59Z) - Latent Autoregressive Source Separation [5.871054749661012]
This paper introduces vector-quantized Latent Autoregressive Source Separation (i.e., de-mixing an input signal into its constituent sources) without requiring additional gradient-based optimization or modifications of existing models.
Our separation method relies on the Bayesian formulation in which the autoregressive models are the priors, and a discrete (non-parametric) likelihood function is constructed by performing frequency counts over latent sums of addend tokens.
arXiv Detail & Related papers (2023-01-09T17:32:00Z) - Deliberation of Streaming RNN-Transducer by Non-autoregressive Decoding [21.978994865937786]
The method performs a few refinement steps, where each step shares a transformer decoder that attends to both text features and audio features.
We show that, conditioned on hypothesis alignments of a streaming RNN-T model, our method obtains significantly more accurate recognition results than the first-pass RNN-T.
arXiv Detail & Related papers (2021-12-01T01:34:28Z) - Cascaded Text Generation with Markov Transformers [122.76100449018061]
Two dominant approaches to neural text generation are fully autoregressive models, using serial beam search decoding, and non-autoregressive models, using parallel decoding with no output dependencies.
This work proposes an autoregressive model with sub-linear parallel time generation. Noting that conditional random fields with bounded context can be decoded in parallel, we propose an efficient cascaded decoding approach for generating high-quality output.
This approach requires only a small modification from standard autoregressive training, while showing competitive accuracy/speed tradeoff compared to existing methods on five machine translation datasets.
arXiv Detail & Related papers (2020-06-01T17:52:15Z) - Imputer: Sequence Modelling via Imputation and Dynamic Programming [101.5705527605346]
Imputer is an iterative generative model, requiring only a constant number of generation steps independent of the number of input or output tokens.
We present a tractable dynamic programming training algorithm, which yields a lower bound on the log marginal likelihood.
arXiv Detail & Related papers (2020-02-20T18:21:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.