Listen Attentively, and Spell Once: Whole Sentence Generation via a
Non-Autoregressive Architecture for Low-Latency Speech Recognition
- URL: http://arxiv.org/abs/2005.04862v4
- Date: Thu, 6 Aug 2020 01:26:15 GMT
- Title: Listen Attentively, and Spell Once: Whole Sentence Generation via a
Non-Autoregressive Architecture for Low-Latency Speech Recognition
- Authors: Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen, Shuai
Zhang
- Abstract summary: We propose a non-autoregressive end-to-end speech recognition system called LASO.
Because of the non-autoregressive property, LASO predicts a textual token in the sequence without the dependence on other tokens.
We conduct experiments on publicly available Chinese dataset AISHELL-1.
- Score: 66.47000813920619
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although attention based end-to-end models have achieved promising
performance in speech recognition, the multi-pass forward computation in
beam-search increases inference time cost, which limits their practical
applications. To address this issue, we propose a non-autoregressive end-to-end
speech recognition system called LASO (listen attentively, and spell once).
Because of the non-autoregressive property, LASO predicts a textual token in
the sequence without the dependence on other tokens. Without beam-search, the
one-pass propagation much reduces inference time cost of LASO. And because the
model is based on the attention based feedforward structure, the computation
can be implemented in parallel efficiently. We conduct experiments on publicly
available Chinese dataset AISHELL-1. LASO achieves a character error rate of
6.4%, which outperforms the state-of-the-art autoregressive transformer model
(6.7%). The average inference latency is 21 ms, which is 1/50 of the
autoregressive transformer model.
Related papers
- COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - LM-assisted keyword biasing with Aho-Corasick algorithm for Transducer-based ASR [3.841280537264271]
We propose a light on-the-fly method to improve automatic speech recognition performance.
We combine a bias list of named entities with a word-level n-gram language model with the shallow fusion approach based on the Aho-Corasick string matching algorithm.
We achieve up to 21.6% relative improvement in the general word error rate with no practical difference in the inverse real-time factor.
arXiv Detail & Related papers (2024-09-20T13:53:37Z) - It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - Prediction of speech intelligibility with DNN-based performance measures [9.883633991083789]
This paper presents a speech intelligibility model based on automatic speech recognition (ASR)
It combines phoneme probabilities from deep neural networks (DNN) and a performance measure that estimates the word error rate from these probabilities.
The proposed model performs almost as well as the label-based model and produces more accurate predictions than the baseline models.
arXiv Detail & Related papers (2022-03-17T08:05:38Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Dissecting User-Perceived Latency of On-Device E2E Speech Recognition [34.645194215436966]
We show that factors affecting token emission latency, and endpointing behavior significantly impact on user-perceived latency (UPL)
We achieve the best trade-off between latency and word error rate when performing ASR jointly with endpointing, and using the recently proposed alignment regularization.
arXiv Detail & Related papers (2021-04-06T00:55:11Z) - Non-Autoregressive Predictive Coding for Learning Speech Representations
from Local Dependencies [91.92060221982064]
We propose Non-Autoregressive Predictive Coding (NPC), a self-supervised method to learn a speech representation in a non-autoregressive manner.
NPC has a conceptually simple objective and can be implemented easily with the introduced Masked Convolution Blocks.
We show that the NPC representation is comparable to other methods in speech experiments on phonetic and speaker classification while being more efficient.
arXiv Detail & Related papers (2020-11-01T02:48:37Z) - Low-Latency Sequence-to-Sequence Speech Recognition and Translation by
Partial Hypothesis Selection [15.525314212209562]
We propose three latency reduction techniques for chunk-based incremental inference.
We show that our approach is also applicable to low-latency speech translation.
arXiv Detail & Related papers (2020-05-22T13:42:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.