CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer
for Speech Recognition
- URL: http://arxiv.org/abs/2010.14725v2
- Date: Thu, 11 Feb 2021 22:40:07 GMT
- Title: CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer
for Speech Recognition
- Authors: Ruchao Fan, Wei Chu, Peng Chang, Jing Xiao
- Abstract summary: We propose a CTC alignment-based single step non-autoregressive decoder transformer (CASS-NAT) for speech recognition.
During inference, an error-based alignment method is proposed to be applied to the CTC space, reducing the WER and retaining the output as well.
- Score: 29.55887842348706
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a CTC alignment-based single step non-autoregressive transformer
(CASS-NAT) for speech recognition. Specifically, the CTC alignment contains the
information of (a) the number of tokens for decoder input, and (b) the time
span of acoustics for each token. The information are used to extract acoustic
representation for each token in parallel, referred to as token-level acoustic
embedding which substitutes the word embedding in autoregressive transformer
(AT) to achieve parallel generation in decoder. During inference, an
error-based alignment sampling method is proposed to be applied to the CTC
output space, reducing the WER and retaining the parallelism as well.
Experimental results show that the proposed method achieves WERs of 3.8%/9.1%
on Librispeech test clean/other dataset without an external LM, and a CER of
5.8% on Aishell1 Mandarin corpus, respectively1. Compared to the AT baseline,
the CASS-NAT has a performance reduction on WER, but is 51.2x faster in terms
of RTF. When decoding with an oracle CTC alignment, the lower bound of WER
without LM reaches 2.3% on the test-clean set, indicating the potential of the
proposed method.
Related papers
- A CTC Alignment-based Non-autoregressive Transformer for End-to-end
Automatic Speech Recognition [26.79184118279807]
We present a CTC Alignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for end-to-end ASR.
word embeddings in the autoregressive transformer (AT) are substituted with token-level acoustic embeddings (TAE) that are extracted from encoder outputs.
We find that CASS-NAT has a WER that is close to AT on various ASR tasks, while providing a 24x inference speedup.
arXiv Detail & Related papers (2023-04-15T18:34:29Z) - Iterative pseudo-forced alignment by acoustic CTC loss for
self-supervised ASR domain adaptation [80.12316877964558]
High-quality data labeling from specific domains is costly and human time-consuming.
We propose a self-supervised domain adaptation method, based upon an iterative pseudo-forced alignment algorithm.
arXiv Detail & Related papers (2022-10-27T07:23:08Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - An Improved Single Step Non-autoregressive Transformer for Automatic
Speech Recognition [28.06475768075206]
Non-autoregressive mechanisms can significantly decrease inference time for speech transformers.
Previous work on CTC alignment-based single step non-autoregressive transformer (CASS-NAT) has shown a large real time factor (RTF) improvement over autoregressive transformers (AT)
We propose several methods to improve the accuracy of the end-to-end CASS-NAT, followed by performance analyses.
arXiv Detail & Related papers (2021-06-18T02:58:30Z) - Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and
Backward Transformers [49.403414751667135]
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR)
The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem.
Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
arXiv Detail & Related papers (2021-04-21T03:05:12Z) - FSR: Accelerating the Inference Process of Transducer-Based Models by
Applying Fast-Skip Regularization [72.9385528828306]
A typical transducer model decodes the output sequence conditioned on the current acoustic state.
The number of blank tokens in the prediction results accounts for nearly 90% of all tokens.
We propose a method named fast-skip regularization, which tries to align the blank position predicted by a transducer with that predicted by a CTC model.
arXiv Detail & Related papers (2021-04-07T03:15:10Z) - Alignment Knowledge Distillation for Online Streaming Attention-based
Speech Recognition [46.69852287267763]
This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems.
The proposed method significantly reduces recognition errors and emission latency simultaneously.
The best MoChA system shows performance comparable to that of RNN-transducer (RNN-T)
arXiv Detail & Related papers (2021-02-28T08:17:38Z) - Intermediate Loss Regularization for CTC-based Speech Recognition [58.33721897180646]
We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification ( CTC) objective.
We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively.
arXiv Detail & Related papers (2021-02-05T15:01:03Z) - Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input [54.82369261350497]
We propose a CTC-enhanced NAR transformer, which generates target sequence by refining predictions of the CTC module.
Experimental results show that our method outperforms all previous NAR counterparts and achieves 50x faster decoding speed than a strong AR baseline with only 0.0 0.3 absolute CER degradation on Aishell-1 and Aishell-2 datasets.
arXiv Detail & Related papers (2020-10-28T15:00:09Z) - CTC-synchronous Training for Monotonic Attention Model [43.0382262234792]
backward probabilities cannot be leveraged in the alignment process during training due to left-to-right dependency in the decoder.
We propose CTC-synchronous training ( CTC-ST), in which MoChA uses CTC alignments to learn optimal monotonic alignments.
The entire model is jointly optimized so that the expected boundaries from MoChA are synchronized with the alignments.
arXiv Detail & Related papers (2020-05-10T16:48:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.