Streaming Models for Joint Speech Recognition and Translation
- URL: http://arxiv.org/abs/2101.09149v1
- Date: Fri, 22 Jan 2021 15:16:54 GMT
- Title: Streaming Models for Joint Speech Recognition and Translation
- Authors: Orion Weller and Matthias Sperber and Christian Gollan and Joris
Kluivers
- Abstract summary: We develop an end-to-end streaming ST model based on a re-translation approach and compare against standard cascading approaches.
We also introduce a novel inference method for the joint case, interleaving both transcript and translation in generation and removing the need to use separate decoders.
- Score: 11.657994715914748
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Using end-to-end models for speech translation (ST) has increasingly been the
focus of the ST community. These models condense the previously cascaded
systems by directly converting sound waves into translated text. However,
cascaded models have the advantage of including automatic speech recognition
output, useful for a variety of practical ST systems that often display
transcripts to the user alongside the translations. To bridge this gap, recent
work has shown initial progress into the feasibility for end-to-end models to
produce both of these outputs. However, all previous work has only looked at
this problem from the consecutive perspective, leaving uncertainty on whether
these approaches are effective in the more challenging streaming setting. We
develop an end-to-end streaming ST model based on a re-translation approach and
compare against standard cascading approaches. We also introduce a novel
inference method for the joint case, interleaving both transcript and
translation in generation and removing the need to use separate decoders. Our
evaluation across a range of metrics capturing accuracy, latency, and
consistency shows that our end-to-end models are statistically similar to
cascading models, while having half the number of parameters. We also find that
both systems provide strong translation quality at low latency, keeping 99% of
consecutive quality at a lag of just under a second.
Related papers
- Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS)
MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z) - Incremental Blockwise Beam Search for Simultaneous Speech Translation
with Controllable Quality-Latency Tradeoff [49.75167556773752]
Blockwise self-attentional encoder models have emerged as one promising end-to-end approach to simultaneous speech translation.
We propose a modified incremental blockwise beam search incorporating local agreement or hold-$n$ policies for quality-latency control.
arXiv Detail & Related papers (2023-09-20T14:59:06Z) - End-to-End Evaluation for Low-Latency Simultaneous Speech Translation [55.525125193856084]
We propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions.
This includes the segmentation of the audio as well as the run-time of the different components.
We also compare different approaches to low-latency speech translation using this framework.
arXiv Detail & Related papers (2023-08-07T09:06:20Z) - Token-Level Serialized Output Training for Joint Streaming ASR and ST
Leveraging Textual Alignments [49.38965743465124]
This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder.
Experiments in monolingual and multilingual settings demonstrate that our approach achieves the best quality-latency balance.
arXiv Detail & Related papers (2023-07-07T02:26:18Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - ON-TRAC Consortium Systems for the IWSLT 2022 Dialect and Low-resource
Speech Translation Tasks [8.651248939672769]
This paper describes the ON-TRAC Consortium translation systems developed for two challenge tracks featured in the Evaluation Campaign of IWSLT 2022: low-resource and dialect speech translation.
We build an end-to-end model as our joint primary submission, and compare it against cascaded models that leverage a large fine-tuned wav2vec 2.0 model for ASR.
Our results highlight that self-supervised models trained on smaller sets of target data are more effective to low-resource end-to-end ST fine-tuning, compared to large off-the-shelf models.
arXiv Detail & Related papers (2022-05-04T10:36:57Z) - RealTranS: End-to-End Simultaneous Speech Translation with Convolutional
Weighted-Shrinking Transformer [33.876412404781846]
RealTranS is an end-to-end model for simultaneous speech translation.
It maps speech features into text space with a weighted-shrinking operation and a semantic encoder.
Experiments show that RealTranS with the Wait-K-Stride-N strategy outperforms prior end-to-end models.
arXiv Detail & Related papers (2021-06-09T06:35:46Z) - Tight Integrated End-to-End Training for Cascaded Speech Translation [40.76367623739673]
A cascaded speech translation model relies on discrete and non-differentiable transcription.
Direct speech translation is an alternative method to avoid error propagation.
This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model.
arXiv Detail & Related papers (2020-11-24T15:43:49Z) - Consistent Transcription and Translation of Speech [13.652411093089947]
We explore the task of jointly transcribing and translating speech.
While high accuracy of transcript and translation are crucial, even highly accurate systems can suffer from inconsistencies between both outputs.
We find that direct models are poorly suited to the joint transcription/translation task, but that end-to-end models that feature a coupled inference procedure are able to achieve strong consistency.
arXiv Detail & Related papers (2020-07-24T19:17:26Z) - Phone Features Improve Speech Translation [69.54616570679343]
End-to-end models for speech translation (ST) more tightly couple speech recognition (ASR) and machine translation (MT)
We compare cascaded and end-to-end models across high, medium, and low-resource conditions, and show that cascades remain stronger baselines.
We show that these features improve both architectures, closing the gap between end-to-end models and cascades, and outperforming previous academic work -- by up to 9 BLEU on our low-resource setting.
arXiv Detail & Related papers (2020-05-27T22:05:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.