Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition
- URL: http://arxiv.org/abs/2206.08317v3
- Date: Thu, 30 Mar 2023 07:00:38 GMT
- Title: Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition
- Authors: Zhifu Gao, Shiliang Zhang, Ian McLoughlin, Zhijie Yan
- Abstract summary: We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
- Score: 62.83832841523525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have recently dominated the ASR field. Although able to yield
good performance, they involve an autoregressive (AR) decoder to generate
tokens one by one, which is computationally inefficient. To speed up inference,
non-autoregressive (NAR) methods, e.g. single-step NAR, were designed, to
enable parallel generation. However, due to an independence assumption within
the output tokens, performance of single-step NAR is inferior to that of AR
models, especially with a large-scale corpus. There are two challenges to
improving single-step NAR: Firstly to accurately predict the number of output
tokens and extract hidden variables; secondly, to enhance modeling of
interdependence between output tokens. To tackle both challenges, we propose a
fast and accurate parallel transformer, termed Paraformer. This utilizes a
continuous integrate-and-fire based predictor to predict the number of tokens
and generate hidden variables. A glancing language model (GLM) sampler then
generates semantic embeddings to enhance the NAR decoder's ability to model
context interdependence. Finally, we design a strategy to generate negative
samples for minimum word error rate training to further improve performance.
Experiments using the public AISHELL-1, AISHELL-2 benchmark, and an
industrial-level 20,000 hour task demonstrate that the proposed Paraformer can
attain comparable performance to the state-of-the-art AR transformer, with more
than 10x speedup.
Related papers
- TAPIR: Learning Adaptive Revision for Incremental Natural Language
Understanding with a Two-Pass Model [14.846377138993645]
Recent neural network-based approaches for incremental processing mainly use RNNs or Transformers.
A restart-incremental interface that repeatedly passes longer input prefixes can be used to obtain partial outputs, while providing the ability to revise.
We propose the Two-pass model for AdaPtIve Revision (TAPIR) and introduce a method to obtain an incremental supervision signal for learning an adaptive revision policy.
arXiv Detail & Related papers (2023-05-18T09:58:19Z) - TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech
Recognition [69.68154370877615]
The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in at least one step.
To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT)
The results show that the TSNAT can achieve a competitive performance with the AR model and outperform many complicated NAR models.
arXiv Detail & Related papers (2021-04-04T02:34:55Z) - Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input [54.82369261350497]
We propose a CTC-enhanced NAR transformer, which generates target sequence by refining predictions of the CTC module.
Experimental results show that our method outperforms all previous NAR counterparts and achieves 50x faster decoding speed than a strong AR baseline with only 0.0 0.3 absolute CER degradation on Aishell-1 and Aishell-2 datasets.
arXiv Detail & Related papers (2020-10-28T15:00:09Z) - An EM Approach to Non-autoregressive Conditional Sequence Generation [49.11858479436565]
Autoregressive (AR) models have been the dominating approach to conditional sequence generation.
Non-autoregressive (NAR) models have been recently proposed to reduce the latency by generating all output tokens in parallel.
This paper proposes a new approach that jointly optimize both AR and NAR models in a unified Expectation-Maximization framework.
arXiv Detail & Related papers (2020-06-29T20:58:57Z) - Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech
Recognition [66.47000813920617]
We propose a spike-triggered non-autoregressive transformer model for end-to-end speech recognition.
The proposed model can accurately predict the length of the target sequence and achieve a competitive performance.
The model even achieves a real-time factor of 0.0056, which exceeds all mainstream speech recognition models.
arXiv Detail & Related papers (2020-05-16T08:27:20Z) - A Study of Non-autoregressive Model for Sequence Generation [147.89525760170923]
Non-autoregressive (NAR) models generate all the tokens of a sequence in parallel.
We propose knowledge distillation and source-target alignment to bridge the gap between AR and NAR models.
arXiv Detail & Related papers (2020-04-22T09:16:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.