Low Latency ASR for Simultaneous Speech Translation
- URL: http://arxiv.org/abs/2003.09891v1
- Date: Sun, 22 Mar 2020 13:37:05 GMT
- Title: Low Latency ASR for Simultaneous Speech Translation
- Authors: Thai Son Nguyen, Jan Niehues, Eunah Cho, Thanh-Le Ha, Kevin Kilgour,
Markus Muller, Matthias Sperber, Sebastian Stueker, Alex Waibel
- Abstract summary: We have worked on several techniques for reducing the latency for both components, the automatic speech recognition and the speech translation module.
We combined run-on decoding with a technique for identifying stable partial hypotheses when stream decoding and a protocol for dynamic output update.
This combination reduces the latency at word level, where the words are final and will never be updated again in the future, from 18.1s to 1.1s without sacrificing performance.
- Score: 27.213294097841853
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: User studies have shown that reducing the latency of our simultaneous lecture
translation system should be the most important goal. We therefore have worked
on several techniques for reducing the latency for both components, the
automatic speech recognition and the speech translation module. Since the
commonly used commitment latency is not appropriate in our case of continuous
stream decoding, we focused on word latency. We used it to analyze the
performance of our current system and to identify opportunities for
improvements. In order to minimize the latency we combined run-on decoding with
a technique for identifying stable partial hypotheses when stream decoding and
a protocol for dynamic output update that allows to revise the most recent
parts of the transcription. This combination reduces the latency at word level,
where the words are final and will never be updated again in the future, from
18.1s to 1.1s without sacrificing performance in terms of word error rate.
Related papers
- A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X)
NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework.
It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z) - Average Token Delay: A Duration-aware Latency Metric for Simultaneous
Translation [16.954965417930254]
We propose a novel latency evaluation metric for simultaneous translation called emphAverage Token Delay (ATD)
We demonstrate its effectiveness through analyses simulating user-side latency based on Ear-Voice Span (EVS)
arXiv Detail & Related papers (2023-11-24T08:53:52Z) - Incremental Blockwise Beam Search for Simultaneous Speech Translation
with Controllable Quality-Latency Tradeoff [49.75167556773752]
Blockwise self-attentional encoder models have emerged as one promising end-to-end approach to simultaneous speech translation.
We propose a modified incremental blockwise beam search incorporating local agreement or hold-$n$ policies for quality-latency control.
arXiv Detail & Related papers (2023-09-20T14:59:06Z) - Token-Level Serialized Output Training for Joint Streaming ASR and ST
Leveraging Textual Alignments [49.38965743465124]
This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder.
Experiments in monolingual and multilingual settings demonstrate that our approach achieves the best quality-latency balance.
arXiv Detail & Related papers (2023-07-07T02:26:18Z) - Average Token Delay: A Latency Metric for Simultaneous Translation [21.142539715996673]
We propose a novel latency evaluation metric called Average Token Delay (ATD)
We discuss the advantage of ATD using simulated examples and also investigate the differences between ATD and Average Lagging with simultaneous translation experiments.
arXiv Detail & Related papers (2022-11-22T06:45:13Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Incremental Text to Speech for Neural Sequence-to-Sequence Models using
Reinforcement Learning [60.20205278845412]
Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised.
This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation.
We propose a reinforcement learning based framework to train an agent to make this decision.
arXiv Detail & Related papers (2020-08-07T11:48:05Z) - Low-Latency Sequence-to-Sequence Speech Recognition and Translation by
Partial Hypothesis Selection [15.525314212209562]
We propose three latency reduction techniques for chunk-based incremental inference.
We show that our approach is also applicable to low-latency speech translation.
arXiv Detail & Related papers (2020-05-22T13:42:54Z) - Minimum Latency Training Strategies for Streaming Sequence-to-Sequence
ASR [44.229256049718316]
Streaming attention-based sequence-to-sequence (S2S) models have been proposed to perform online speech recognition with linear-time decoding complexity.
In these models, the decisions to generate tokens are delayed compared to the actual acoustic boundaries since their unidirectional encoders lack future information.
We propose several strategies during training by leveraging external hard alignments extracted from the hybrid model.
Experiments on the Cortana voice search task demonstrate that our proposed methods can significantly reduce the latency, and even improve the recognition accuracy in certain cases on the decoder side.
arXiv Detail & Related papers (2020-04-10T12:24:49Z) - Scaling Up Online Speech Recognition Using ConvNets [33.75588539732141]
We design an online end-to-end speech recognition system based on Time-Depth Separable ( TDS) convolutions and Connectionist Temporal Classification (CTC)
We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy.
The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate.
arXiv Detail & Related papers (2020-01-27T12:55:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.