Low-Latency Sequence-to-Sequence Speech Recognition and Translation by
Partial Hypothesis Selection
- URL: http://arxiv.org/abs/2005.11185v2
- Date: Tue, 13 Oct 2020 16:18:44 GMT
- Title: Low-Latency Sequence-to-Sequence Speech Recognition and Translation by
Partial Hypothesis Selection
- Authors: Danni Liu, Gerasimos Spanakis, Jan Niehues
- Abstract summary: We propose three latency reduction techniques for chunk-based incremental inference.
We show that our approach is also applicable to low-latency speech translation.
- Score: 15.525314212209562
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Encoder-decoder models provide a generic architecture for
sequence-to-sequence tasks such as speech recognition and translation. While
offline systems are often evaluated on quality metrics like word error rates
(WER) and BLEU, latency is also a crucial factor in many practical use-cases.
We propose three latency reduction techniques for chunk-based incremental
inference and evaluate their efficiency in terms of accuracy-latency trade-off.
On the 300-hour How2 dataset, we reduce latency by 83% to 0.8 second by
sacrificing 1% WER (6% rel.) compared to offline transcription. Although our
experiments use the Transformer, the hypothesis selection strategies are
applicable to other encoder-decoder models. To avoid expensive re-computation,
we use a unidirectionally-attending encoder. After an adaptation procedure to
partial sequences, the unidirectional model performs on-par with the original
model. We further show that our approach is also applicable to low-latency
speech translation. On How2 English-Portuguese speech translation, we reduce
latency to 0.7 second (-84% rel.) while incurring a loss of 2.4 BLEU points (5%
rel.) compared to the offline system.
Related papers
- Optimizing Multi-Stuttered Speech Classification: Leveraging Whisper's Encoder for Efficient Parameter Reduction in Automated Assessment [0.14999444543328289]
This research study unveils the contribution of the last encoder layer in the identification of disfluencies in stuttered speech.
It has led to a computationally efficient approach, 83.7% less parameters to train, making the proposed approach more adaptable for various dialects and languages.
arXiv Detail & Related papers (2024-06-09T13:42:51Z) - Extreme Encoder Output Frame Rate Reduction: Improving Computational
Latencies of Large End-to-End Models [59.57732929473519]
We apply multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames.
We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task.
arXiv Detail & Related papers (2024-02-27T03:40:44Z) - Decoder Tuning: Efficient Language Understanding as Decoding [84.68266271483022]
We present Decoder Tuning (DecT), which in contrast optimize task-specific decoder networks on the output side.
By gradient-based optimization, DecT can be trained within several seconds and requires only one P query per sample.
We conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200times$ speed-up.
arXiv Detail & Related papers (2022-12-16T11:15:39Z) - Dynamic Latency for CTC-Based Streaming Automatic Speech Recognition
With Emformer [0.4588028371034407]
A frame-level model using efficient augment memory transformer block and dynamic latency training method is employed for streaming automatic speech recognition.
With an average latency of 640ms, our model achieves a relative WER reduction of 6.4% on test-clean and 3.0% on test-other versus the truncate chunk-wise Transformer.
arXiv Detail & Related papers (2022-03-29T14:31:06Z) - Latency Adjustable Transformer Encoder for Language Understanding [0.8287206589886879]
This paper proposes an efficient Transformer architecture that adjusts the inference computational cost adaptively with a desired inference latency speedup.
The proposed method detects less important hidden sequence elements (word-vectors) and eliminates them in each encoder layer using a proposed Attention Context Contribution (ACC) metric.
The proposed method mathematically and experimentally improves the inference latency of BERT_base and GPT-2 by up to 4.8 and 3.72 times with less than 0.75% accuracy drop and passable perplexity on average.
arXiv Detail & Related papers (2022-01-10T13:04:39Z) - Transformer Based Deliberation for Two-Pass Speech Recognition [46.86118010771703]
Speech recognition systems must generate words quickly while also producing accurate results.
Two-pass models excel at these requirements by employing a first-pass decoder that quickly emits words, and a second-pass decoder that requires more context but is more accurate.
Previous work has established that a deliberation network can be an effective second-pass model.
arXiv Detail & Related papers (2021-01-27T18:05:22Z) - FastEmit: Low-latency Streaming ASR with Sequence-level Emission
Regularization [78.46088089185156]
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible.
Existing approaches penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models.
We propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models.
arXiv Detail & Related papers (2020-10-21T17:05:01Z) - Listen Attentively, and Spell Once: Whole Sentence Generation via a
Non-Autoregressive Architecture for Low-Latency Speech Recognition [66.47000813920619]
We propose a non-autoregressive end-to-end speech recognition system called LASO.
Because of the non-autoregressive property, LASO predicts a textual token in the sequence without the dependence on other tokens.
We conduct experiments on publicly available Chinese dataset AISHELL-1.
arXiv Detail & Related papers (2020-05-11T04:45:02Z) - Low Latency ASR for Simultaneous Speech Translation [27.213294097841853]
We have worked on several techniques for reducing the latency for both components, the automatic speech recognition and the speech translation module.
We combined run-on decoding with a technique for identifying stable partial hypotheses when stream decoding and a protocol for dynamic output update.
This combination reduces the latency at word level, where the words are final and will never be updated again in the future, from 18.1s to 1.1s without sacrificing performance.
arXiv Detail & Related papers (2020-03-22T13:37:05Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.