Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR
- URL: http://arxiv.org/abs/2410.02597v1
- Date: Thu, 3 Oct 2024 15:38:20 GMT
- Title: Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR
- Authors: Hainan Xu, Travis M. Bartley, Vladimir Bataev, Boris Ginsburg,
- Abstract summary: We present textbfHybrid-textbfAutoregressive textbfINference TrtextbfANsducers (HAINAN), a novel architecture for speech recognition.
HAINAN supports both autoregressive inference with all network components and non-autoregressive inference without the predictor.
- Score: 17.950722198543897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present \textbf{H}ybrid-\textbf{A}utoregressive \textbf{IN}ference Tr\textbf{AN}sducers (HAINAN), a novel architecture for speech recognition that extends the Token-and-Duration Transducer (TDT) model. Trained with randomly masked predictor network outputs, HAINAN supports both autoregressive inference with all network components and non-autoregressive inference without the predictor. Additionally, we propose a novel semi-autoregressive inference paradigm that first generates an initial hypothesis using non-autoregressive inference, followed by refinement steps where each token prediction is regenerated using parallelized autoregression on the initial hypothesis. Experiments on multiple datasets across different languages demonstrate that HAINAN achieves efficiency parity with CTC in non-autoregressive mode and with TDT in autoregressive mode. In terms of accuracy, autoregressive HAINAN outperforms TDT and RNN-T, while non-autoregressive HAINAN significantly outperforms CTC. Semi-autoregressive inference further enhances the model's accuracy with minimal computational overhead, and even outperforms TDT results in some cases. These results highlight HAINAN's flexibility in balancing accuracy and speed, positioning it as a strong candidate for real-world speech recognition applications.
Related papers
- Utilizing Multiple Inputs Autoregressive Models for Bearing Remaining
Useful Life Prediction [3.448070371030467]
We introduce a novel multi-input autoregressive model to address this challenge in RUL prediction for bearings.
Through autoregressive iterations, the model attains a global receptive field, effectively overcoming the limitations in generalization.
Empirical evaluation on the PMH2012 dataset demonstrates that our model, compared to other backbone networks using similar autoregressive approaches, achieves significantly lower Root Mean Square Error (RMSE) and Score.
arXiv Detail & Related papers (2023-11-26T09:50:32Z) - Instance-based Learning with Prototype Reduction for Real-Time
Proportional Myocontrol: A Randomized User Study Demonstrating
Accuracy-preserving Data Reduction for Prosthetic Embedded Systems [0.0]
This work presents the design, implementation and validation of learning techniques based on the kNN scheme for gesture detection in prosthetic control.
The influence of parameterization and varying proportionality schemes is analyzed, utilizing an eight-channel-sEMG armband.
arXiv Detail & Related papers (2023-08-21T20:15:35Z) - Optimizing Non-Autoregressive Transformers with Contrastive Learning [74.46714706658517]
Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order.
In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution.
arXiv Detail & Related papers (2023-05-23T04:20:13Z) - Directed Acyclic Transformer Pre-training for High-quality
Non-autoregressive Text Generation [98.37871690400766]
Non-AutoRegressive (NAR) text generation models have drawn much attention because of their significantly faster decoding speed and good generation quality in machine translation.
Existing NAR models lack proper pre-training, making them still far behind the pre-trained autoregressive models.
We propose Pre-trained Directed Acyclic Transformer to promote prediction consistency in NAR generation.
arXiv Detail & Related papers (2023-04-24T02:30:33Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - Non-Autoregressive Text Generation with Pre-trained Language Models [40.50508206201288]
We show that BERT can be employed as the backbone of a NAG model to greatly improve performance.
We devise mechanisms to alleviate the two common problems of vanilla NAG models.
We propose a new decoding strategy, ratio-first, for applications where the output lengths can be approximately estimated beforehand.
arXiv Detail & Related papers (2021-02-16T15:30:33Z) - Enriching Non-Autoregressive Transformer with Syntactic and
SemanticStructures for Neural Machine Translation [54.864148836486166]
We propose to incorporate the explicit syntactic and semantic structures of languages into a non-autoregressive Transformer.
Our model achieves a significantly faster speed, as well as keeps the translation quality when compared with several state-of-the-art non-autoregressive models.
arXiv Detail & Related papers (2021-01-22T04:12:17Z) - On Minimum Word Error Rate Training of the Hybrid Autoregressive
Transducer [40.63693071222628]
We study the minimum word error rate (MWER) training of Hybrid Autoregressive Transducer (HAT)
From experiments with around 30,000 hours of training data, we show that MWER training can improve the accuracy of HAT models.
arXiv Detail & Related papers (2020-10-23T21:16:30Z) - BERT Loses Patience: Fast and Robust Inference with Early Exit [91.26199404912019]
We propose Patience-based Early Exit as a plug-and-play technique to improve the efficiency and robustness of a pretrained language model.
Our approach improves inference efficiency as it allows the model to make a prediction with fewer layers.
arXiv Detail & Related papers (2020-06-07T13:38:32Z) - Aligned Cross Entropy for Non-Autoregressive Machine Translation [120.15069387374717]
We propose aligned cross entropy (AXE) as an alternative loss function for training of non-autoregressive models.
AXE-based training of conditional masked language models (CMLMs) substantially improves performance on major WMT benchmarks.
arXiv Detail & Related papers (2020-04-03T16:24:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.