A CTC Alignment-based Non-autoregressive Transformer for End-to-end
Automatic Speech Recognition
- URL: http://arxiv.org/abs/2304.07611v1
- Date: Sat, 15 Apr 2023 18:34:29 GMT
- Title: A CTC Alignment-based Non-autoregressive Transformer for End-to-end
Automatic Speech Recognition
- Authors: Ruchao Fan, Wei Chu, Peng Chang, and Abeer Alwan
- Abstract summary: We present a CTC Alignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for end-to-end ASR.
word embeddings in the autoregressive transformer (AT) are substituted with token-level acoustic embeddings (TAE) that are extracted from encoder outputs.
We find that CASS-NAT has a WER that is close to AT on various ASR tasks, while providing a 24x inference speedup.
- Score: 26.79184118279807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, end-to-end models have been widely used in automatic speech
recognition (ASR) systems. Two of the most representative approaches are
connectionist temporal classification (CTC) and attention-based encoder-decoder
(AED) models. Autoregressive transformers, variants of AED, adopt an
autoregressive mechanism for token generation and thus are relatively slow
during inference. In this paper, we present a comprehensive study of a CTC
Alignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for
end-to-end ASR. In CASS-NAT, word embeddings in the autoregressive transformer
(AT) are substituted with token-level acoustic embeddings (TAE) that are
extracted from encoder outputs with the acoustical boundary information offered
by the CTC alignment. TAE can be obtained in parallel, resulting in a parallel
generation of output tokens. During training, Viterbi-alignment is used for TAE
generation, and multiple training strategies are further explored to improve
the word error rate (WER) performance. During inference, an error-based
alignment sampling method is investigated in depth to reduce the alignment
mismatch in the training and testing processes. Experimental results show that
the CASS-NAT has a WER that is close to AT on various ASR tasks, while
providing a ~24x inference speedup. With and without self-supervised learning,
we achieve new state-of-the-art results for non-autoregressive models on
several datasets. We also analyze the behavior of the CASS-NAT decoder to
explain why it can perform similarly to AT. We find that TAEs have similar
functionality to word embeddings for grammatical structures, which might
indicate the possibility of learning some semantic information from TAEs
without a language model.
Related papers
- Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - Unimodal Aggregation for CTC-based Speech Recognition [7.6112706449833505]
A unimodal aggregation (UMA) is proposed to segment and integrate the feature frames that belong to the same text token.
UMA learns better feature representations and shortens the sequence length, resulting in lower recognition error and computational complexity.
arXiv Detail & Related papers (2023-09-15T04:34:40Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z) - Optimizing Non-Autoregressive Transformers with Contrastive Learning [74.46714706658517]
Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order.
In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution.
arXiv Detail & Related papers (2023-05-23T04:20:13Z) - Sequence Transduction with Graph-based Supervision [96.04967815520193]
We present a new transducer objective function that generalizes the RNN-T loss to accept a graph representation of the labels.
We demonstrate that transducer-based ASR with CTC-like lattice achieves better results compared to standard RNN-T.
arXiv Detail & Related papers (2021-11-01T21:51:42Z) - An Improved Single Step Non-autoregressive Transformer for Automatic
Speech Recognition [28.06475768075206]
Non-autoregressive mechanisms can significantly decrease inference time for speech transformers.
Previous work on CTC alignment-based single step non-autoregressive transformer (CASS-NAT) has shown a large real time factor (RTF) improvement over autoregressive transformers (AT)
We propose several methods to improve the accuracy of the end-to-end CASS-NAT, followed by performance analyses.
arXiv Detail & Related papers (2021-06-18T02:58:30Z) - N-Best ASR Transformer: Enhancing SLU Performance using Multiple ASR
Hypotheses [0.0]
Spoken Language Understanding (SLU) parses speech into semantic structures like dialog acts and slots.
We show that our approach significantly outperforms the prior state-of-the-art when subjected to the low data regime.
arXiv Detail & Related papers (2021-06-11T17:29:00Z) - Alignment Knowledge Distillation for Online Streaming Attention-based
Speech Recognition [46.69852287267763]
This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems.
The proposed method significantly reduces recognition errors and emission latency simultaneously.
The best MoChA system shows performance comparable to that of RNN-transducer (RNN-T)
arXiv Detail & Related papers (2021-02-28T08:17:38Z) - Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency.
We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z) - A Correspondence Variational Autoencoder for Unsupervised Acoustic Word
Embeddings [50.524054820564395]
We propose a new unsupervised model for mapping a variable-duration speech segment to a fixed-dimensional representation.
The resulting acoustic word embeddings can form the basis of search, discovery, and indexing systems for low- and zero-resource languages.
arXiv Detail & Related papers (2020-12-03T19:24:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.