Dissecting User-Perceived Latency of On-Device E2E Speech Recognition
- URL: http://arxiv.org/abs/2104.02207v1
- Date: Tue, 6 Apr 2021 00:55:11 GMT
- Title: Dissecting User-Perceived Latency of On-Device E2E Speech Recognition
- Authors: Yuan Shangguan, Rohit Prabhavalkar, Hang Su, Jay Mahadeokar, Yangyang
Shi, Jiatong Zhou, Chunyang Wu, Duc Le, Ozlem Kalinli, Christian Fuegen,
Michael L. Seltzer
- Abstract summary: We show that factors affecting token emission latency, and endpointing behavior significantly impact on user-perceived latency (UPL)
We achieve the best trade-off between latency and word error rate when performing ASR jointly with endpointing, and using the recently proposed alignment regularization.
- Score: 34.645194215436966
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As speech-enabled devices such as smartphones and smart speakers become
increasingly ubiquitous, there is growing interest in building automatic speech
recognition (ASR) systems that can run directly on-device; end-to-end (E2E)
speech recognition models such as recurrent neural network transducers and
their variants have recently emerged as prime candidates for this task. Apart
from being accurate and compact, such systems need to decode speech with low
user-perceived latency (UPL), producing words as soon as they are spoken. This
work examines the impact of various techniques -- model architectures, training
criteria, decoding hyperparameters, and endpointer parameters -- on UPL. Our
analyses suggest that measures of model size (parameters, input chunk sizes),
or measures of computation (e.g., FLOPS, RTF) that reflect the model's ability
to process input frames are not always strongly correlated with observed UPL.
Thus, conventional algorithmic latency measurements might be inadequate in
accurately capturing latency observed when models are deployed on embedded
devices. Instead, we find that factors affecting token emission latency, and
endpointing behavior significantly impact on UPL. We achieve the best trade-off
between latency and word error rate when performing ASR jointly with
endpointing, and using the recently proposed alignment regularization.
Related papers
- Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - Unified End-to-End Speech Recognition and Endpointing for Fast and
Efficient Speech Systems [17.160006765475988]
We propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) model.
We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model.
This results in a single E2E model that can be used during inference to perform frame filtering at low cost.
arXiv Detail & Related papers (2022-11-01T23:43:15Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - A review of on-device fully neural end-to-end automatic speech
recognition algorithms [20.469868150587075]
We review various end-to-end automatic speech recognition algorithms and their optimization techniques for on-device applications.
fully neural network end-to-end speech recognition algorithms have been proposed.
We extensively discuss their structures, performance, and advantages compared to conventional algorithms.
arXiv Detail & Related papers (2020-12-14T22:18:08Z) - Streaming end-to-end multi-talker speech recognition [34.76106500736099]
We propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition.
Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints.
Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT.
arXiv Detail & Related papers (2020-11-26T06:28:04Z) - Listen Attentively, and Spell Once: Whole Sentence Generation via a
Non-Autoregressive Architecture for Low-Latency Speech Recognition [66.47000813920619]
We propose a non-autoregressive end-to-end speech recognition system called LASO.
Because of the non-autoregressive property, LASO predicts a textual token in the sequence without the dependence on other tokens.
We conduct experiments on publicly available Chinese dataset AISHELL-1.
arXiv Detail & Related papers (2020-05-11T04:45:02Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - Scaling Up Online Speech Recognition Using ConvNets [33.75588539732141]
We design an online end-to-end speech recognition system based on Time-Depth Separable ( TDS) convolutions and Connectionist Temporal Classification (CTC)
We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy.
The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate.
arXiv Detail & Related papers (2020-01-27T12:55:02Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.