Listen Attentively, and Spell Once: Whole Sentence Generation via a
Non-Autoregressive Architecture for Low-Latency Speech Recognition
- URL: http://arxiv.org/abs/2005.04862v4
- Date: Thu, 6 Aug 2020 01:26:15 GMT
- Title: Listen Attentively, and Spell Once: Whole Sentence Generation via a
Non-Autoregressive Architecture for Low-Latency Speech Recognition
- Authors: Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen, Shuai
Zhang
- Abstract summary: We propose a non-autoregressive end-to-end speech recognition system called LASO.
Because of the non-autoregressive property, LASO predicts a textual token in the sequence without the dependence on other tokens.
We conduct experiments on publicly available Chinese dataset AISHELL-1.
- Score: 66.47000813920619
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although attention based end-to-end models have achieved promising
performance in speech recognition, the multi-pass forward computation in
beam-search increases inference time cost, which limits their practical
applications. To address this issue, we propose a non-autoregressive end-to-end
speech recognition system called LASO (listen attentively, and spell once).
Because of the non-autoregressive property, LASO predicts a textual token in
the sequence without the dependence on other tokens. Without beam-search, the
one-pass propagation much reduces inference time cost of LASO. And because the
model is based on the attention based feedforward structure, the computation
can be implemented in parallel efficiently. We conduct experiments on publicly
available Chinese dataset AISHELL-1. LASO achieves a character error rate of
6.4%, which outperforms the state-of-the-art autoregressive transformer model
(6.7%). The average inference latency is 21 ms, which is 1/50 of the
autoregressive transformer model.
Related papers
- VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - Speeding Up Speech Synthesis In Diffusion Models By Reducing Data
Distribution Recovery Steps Via Content Transfer [3.2634122554914002]
Diffusion based vocoders have been criticised for being slow due to the many steps required during sampling.
We propose a setup where the targets are the different outputs of forward process time steps.
We show through extensive evaluation that the proposed technique generates high-fidelity speech in competitive time.
arXiv Detail & Related papers (2023-09-18T10:35:27Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - Prediction of speech intelligibility with DNN-based performance measures [9.883633991083789]
This paper presents a speech intelligibility model based on automatic speech recognition (ASR)
It combines phoneme probabilities from deep neural networks (DNN) and a performance measure that estimates the word error rate from these probabilities.
The proposed model performs almost as well as the label-based model and produces more accurate predictions than the baseline models.
arXiv Detail & Related papers (2022-03-17T08:05:38Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Dissecting User-Perceived Latency of On-Device E2E Speech Recognition [34.645194215436966]
We show that factors affecting token emission latency, and endpointing behavior significantly impact on user-perceived latency (UPL)
We achieve the best trade-off between latency and word error rate when performing ASR jointly with endpointing, and using the recently proposed alignment regularization.
arXiv Detail & Related papers (2021-04-06T00:55:11Z) - Non-Autoregressive Predictive Coding for Learning Speech Representations
from Local Dependencies [91.92060221982064]
We propose Non-Autoregressive Predictive Coding (NPC), a self-supervised method to learn a speech representation in a non-autoregressive manner.
NPC has a conceptually simple objective and can be implemented easily with the introduced Masked Convolution Blocks.
We show that the NPC representation is comparable to other methods in speech experiments on phonetic and speaker classification while being more efficient.
arXiv Detail & Related papers (2020-11-01T02:48:37Z) - Low-Latency Sequence-to-Sequence Speech Recognition and Translation by
Partial Hypothesis Selection [15.525314212209562]
We propose three latency reduction techniques for chunk-based incremental inference.
We show that our approach is also applicable to low-latency speech translation.
arXiv Detail & Related papers (2020-05-22T13:42:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.