Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input
- URL: http://arxiv.org/abs/2010.15025v2
- Date: Fri, 16 Apr 2021 03:42:42 GMT
- Title: Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input
- Authors: Xingchen Song, Zhiyong Wu, Yiheng Huang, Chao Weng, Dan Su, Helen Meng
- Abstract summary: We propose a CTC-enhanced NAR transformer, which generates target sequence by refining predictions of the CTC module.
Experimental results show that our method outperforms all previous NAR counterparts and achieves 50x faster decoding speed than a strong AR baseline with only 0.0 0.3 absolute CER degradation on Aishell-1 and Aishell-2 datasets.
- Score: 54.82369261350497
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-autoregressive (NAR) transformer models have achieved significantly
inference speedup but at the cost of inferior accuracy compared to
autoregressive (AR) models in automatic speech recognition (ASR). Most of the
NAR transformers take a fixed-length sequence filled with MASK tokens or a
redundant sequence copied from encoder states as decoder input, they cannot
provide efficient target-side information thus leading to accuracy degradation.
To address this problem, we propose a CTC-enhanced NAR transformer, which
generates target sequence by refining predictions of the CTC module.
Experimental results show that our method outperforms all previous NAR
counterparts and achieves 50x faster decoding speed than a strong AR baseline
with only 0.0 ~ 0.3 absolute CER degradation on Aishell-1 and Aishell-2
datasets.
Related papers
- ASR Error Correction with Constrained Decoding on Operation Prediction [8.701142327932484]
We propose an ASR error correction method utilizing the predictions of correction operations.
Experiments on three public datasets demonstrate the effectiveness of the proposed approach in reducing the latency of the decoding process.
arXiv Detail & Related papers (2022-08-09T09:59:30Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - D^2ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale
Attention [27.354159713970322]
We propose a decoder-only detector called D2ETR.
In the absence of encoder, the decoder directly attends to the fine-fused feature maps generated by the Transformer backbone.
D2ETR demonstrates low computational complexity and high detection accuracy in evaluations on the COCO benchmark.
arXiv Detail & Related papers (2022-03-02T04:21:12Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - Non-autoregressive End-to-end Speech Translation with Parallel
Autoregressive Rescoring [83.32560748324667]
This article describes an efficient end-to-end speech translation (E2E-ST) framework based on non-autoregressive (NAR) models.
We propose a unified NAR E2E-ST framework called Orthros, which has an NAR decoder and an auxiliary shallow AR decoder on top of the shared encoder.
arXiv Detail & Related papers (2021-09-09T16:50:16Z) - An Improved Single Step Non-autoregressive Transformer for Automatic
Speech Recognition [28.06475768075206]
Non-autoregressive mechanisms can significantly decrease inference time for speech transformers.
Previous work on CTC alignment-based single step non-autoregressive transformer (CASS-NAT) has shown a large real time factor (RTF) improvement over autoregressive transformers (AT)
We propose several methods to improve the accuracy of the end-to-end CASS-NAT, followed by performance analyses.
arXiv Detail & Related papers (2021-06-18T02:58:30Z) - Instantaneous Grammatical Error Correction with Shallow Aggressive
Decoding [57.08875260900373]
We propose Shallow Aggressive Decoding (SAD) to improve the online inference efficiency of the Transformer for instantaneous Grammatical Error Correction (GEC)
SAD aggressively decodes as many tokens as possible in parallel instead of always decoding only one token in each step to improve computational parallelism.
Experiments in both English and Chinese GEC benchmarks show that aggressive decoding could yield the same predictions but with a significant speedup for online inference.
arXiv Detail & Related papers (2021-06-09T10:30:59Z) - FSR: Accelerating the Inference Process of Transducer-Based Models by
Applying Fast-Skip Regularization [72.9385528828306]
A typical transducer model decodes the output sequence conditioned on the current acoustic state.
The number of blank tokens in the prediction results accounts for nearly 90% of all tokens.
We propose a method named fast-skip regularization, which tries to align the blank position predicted by a transducer with that predicted by a CTC model.
arXiv Detail & Related papers (2021-04-07T03:15:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.