Learning When to Speak: Latency and Quality Trade-offs for Simultaneous
Speech-to-Speech Translation with Offline Models
- URL: http://arxiv.org/abs/2306.01201v1
- Date: Thu, 1 Jun 2023 23:29:23 GMT
- Title: Learning When to Speak: Latency and Quality Trade-offs for Simultaneous
Speech-to-Speech Translation with Offline Models
- Authors: Liam Dugan, Anshul Wadhawan, Kyle Spence, Chris Callison-Burch, Morgan
McGuire, Victor Zordan
- Abstract summary: We introduce a system for simultaneous S2ST targeting real-world use cases.
Our system supports translation from 57 languages to English with tunable parameters for dynamically adjusting the latency of the output.
We show that these policies achieve offline-level accuracy with minimal increases in latency over a Greedy (wait-$k$) baseline.
- Score: 18.34485337755259
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work in speech-to-speech translation (S2ST) has focused primarily on
offline settings, where the full input utterance is available before any output
is given. This, however, is not reasonable in many real-world scenarios. In
latency-sensitive applications, rather than waiting for the full utterance,
translations should be spoken as soon as the information in the input is
present. In this work, we introduce a system for simultaneous S2ST targeting
real-world use cases. Our system supports translation from 57 languages to
English with tunable parameters for dynamically adjusting the latency of the
output -- including four policies for determining when to speak an output
sequence. We show that these policies achieve offline-level accuracy with
minimal increases in latency over a Greedy (wait-$k$) baseline. We open-source
our evaluation code and interactive test script to aid future SimulS2ST
research and application development.
Related papers
- Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice [52.747242157396315]
Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry.<n>We introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities.
arXiv Detail & Related papers (2025-07-23T14:07:41Z) - CMU's IWSLT 2025 Simultaneous Speech Translation System [10.40867923457809]
This paper presents CMU's submission to the IWSLT 2025 Simultaneous Speech Translation task.<n>Our end-to-end speech-to-text system integrates a chunkwise causal Wav2Vec 2.0 speech encoder, an adapter, and the Qwen2.5-7B-Instruct as the decoder.<n> Experimental results demonstrate that our system achieves 44.3 BLEU for English-to-Chinese and 25.1 BLEU for English-to-German.
arXiv Detail & Related papers (2025-06-16T06:56:21Z) - SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation [14.57248739077317]
This paper proposes SimulS2S-LLM, which trains speech LLMs offline and employs a test-time policy to guide simultaneous inference.
SimulS2S-LLM achieves simultaneous speech-to-speech translation (Simul-S2ST) by predicting discrete output speech tokens and then synthesising output speech using a pre-trained vocoder.
arXiv Detail & Related papers (2025-04-22T01:05:32Z) - FASST: Fast LLM-based Simultaneous Speech Translation [9.65638081954595]
Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly.
We propose FASST, a fast large language model based method for streaming speech translation.
Experiment results show that FASST achieves the best quality-latency trade-off.
arXiv Detail & Related papers (2024-08-18T10:12:39Z) - PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems [7.326036800127981]
Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems.
generating a spoken response requires the prior generation of a written response, and speech sequences are significantly longer than text sequences.
This study addresses these issues by extending the input and output sequences of the language model to support the parallel generation of text and speech.
arXiv Detail & Related papers (2024-06-18T09:23:54Z) - An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - Towards Real-World Streaming Speech Translation for Code-Switched Speech [7.81154319203032]
Code-switching (CS) is a common phenomenon in communication and can be challenging in many Natural Language Processing (NLP) settings.
We focus on two essential yet unexplored areas for real-world CS speech translation: streaming settings and translation to a third language.
arXiv Detail & Related papers (2023-10-19T11:15:02Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Code-Switching without Switching: Language Agnostic End-to-End Speech
Translation [68.8204255655161]
We treat speech recognition and translation as one unified end-to-end speech translation problem.
By training LAST with both input languages, we decode speech into one target language, regardless of the input language.
arXiv Detail & Related papers (2022-10-04T10:34:25Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - RealTranS: End-to-End Simultaneous Speech Translation with Convolutional
Weighted-Shrinking Transformer [33.876412404781846]
RealTranS is an end-to-end model for simultaneous speech translation.
It maps speech features into text space with a weighted-shrinking operation and a semantic encoder.
Experiments show that RealTranS with the Wait-K-Stride-N strategy outperforms prior end-to-end models.
arXiv Detail & Related papers (2021-06-09T06:35:46Z) - SimulEval: An Evaluation Toolkit for Simultaneous Translation [59.02724214432792]
Simultaneous translation on both text and speech focuses on a real-time and low-latency scenario.
SimulEval is an easy-to-use and general evaluation toolkit for both simultaneous text and speech translation.
arXiv Detail & Related papers (2020-07-31T17:44:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.