SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation
- URL: http://arxiv.org/abs/2504.15509v1
- Date: Tue, 22 Apr 2025 01:05:32 GMT
- Title: SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation
- Authors: Keqi Deng, Wenxi Chen, Xie Chen, Philip C. Woodland,
- Abstract summary: This paper proposes SimulS2S-LLM, which trains speech LLMs offline and employs a test-time policy to guide simultaneous inference.<n>SimulS2S-LLM achieves simultaneous speech-to-speech translation (Simul-S2ST) by predicting discrete output speech tokens and then synthesising output speech using a pre-trained vocoder.
- Score: 14.57248739077317
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Simultaneous speech translation (SST) outputs translations in parallel with streaming speech input, balancing translation quality and latency. While large language models (LLMs) have been extended to handle the speech modality, streaming remains challenging as speech is prepended as a prompt for the entire generation process. To unlock LLM streaming capability, this paper proposes SimulS2S-LLM, which trains speech LLMs offline and employs a test-time policy to guide simultaneous inference. SimulS2S-LLM alleviates the mismatch between training and inference by extracting boundary-aware speech prompts that allows it to be better matched with text input data. SimulS2S-LLM achieves simultaneous speech-to-speech translation (Simul-S2ST) by predicting discrete output speech tokens and then synthesising output speech using a pre-trained vocoder. An incremental beam search is designed to expand the search space of speech token prediction without increasing latency. Experiments on the CVSS speech data show that SimulS2S-LLM offers a better translation quality-latency trade-off than existing methods that use the same training data, such as improving ASR-BLEU scores by 3 points at similar latency.
Related papers
- Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture [14.056534007451763]
Simultaneous speech translation (SimulST) produces translations incrementally while processing partial speech input.<n>Existing LLM-based SimulST approaches incur significant computational overhead due to repeated encoding of bidirectional speech encoder.<n>We introduce Efficient and Adaptive Simultaneous Speech Translation (EASiST) with fully unidirectional architecture.
arXiv Detail & Related papers (2025-04-16T06:46:15Z) - SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer [68.78023656892319]
This paper presents a dual-stream text-to-speech (TTS) model, SyncSpeech, capable of receiving streaming text input from upstream models while simultaneously generating streaming speech.<n>SyncSpeech has the following advantages: Low latency, as it begins generating streaming speech upon receiving the second text token; High efficiency, as it decodes all speech tokens corresponding to the each arrived text token in one step.
arXiv Detail & Related papers (2025-02-16T12:14:17Z) - FASST: Fast LLM-based Simultaneous Speech Translation [9.65638081954595]
Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly.
We propose FASST, a fast large language model based method for streaming speech translation.
Experiment results show that FASST achieves the best quality-latency trade-off.
arXiv Detail & Related papers (2024-08-18T10:12:39Z) - A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X)
NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework.
It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z) - StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning [48.84039953531356]
StreamSpeech is a direct Simul-S2ST model that jointly learns translation and simultaneous policy.
Experiments on CVSS benchmark demonstrate that StreamSpeech achieves state-of-the-art performance in both offline S2ST and Simul-S2ST tasks.
arXiv Detail & Related papers (2024-06-05T08:24:22Z) - BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing [35.31866559807704]
modality alignment between speech and text remains an open problem.
We propose the BLSP approach that bootstraps Language-Speech Pre-training via behavior alignment of continuation writing.
We demonstrate that this straightforward process can extend the capabilities of LLMs to speech, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios.
arXiv Detail & Related papers (2023-09-02T11:46:05Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.