StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model
- URL: http://arxiv.org/abs/2507.07803v2
- Date: Sun, 13 Jul 2025 01:40:13 GMT
- Title: StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model
- Authors: Shoutao Guo, Xiang Li, Mengge Liu, Wei Chen, Yang Feng,
- Abstract summary: Streaming speech translation (StreamST) requires determining appropriate timing, known as policy, to generate translations.<n>Existing StreamST methods typically operate on sentence-level speech segments, referred to as simultaneous speech translation (SimulST)<n>We propose StreamUni, which achieves StreamST through a unified Large Speech-Language Model (LSLM)
- Score: 20.978001644716063
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Streaming speech translation (StreamST) requires determining appropriate timing, known as policy, to generate translations while continuously receiving source speech inputs, balancing low latency with high translation quality. However, existing StreamST methods typically operate on sentence-level speech segments, referred to as simultaneous speech translation (SimulST). In practice, they require collaboration with segmentation models to accomplish StreamST, where the truncated speech segments constrain SimulST models to make policy decisions and generate translations based on limited contextual information. Moreover, SimulST models struggle to learn effective policies due to the complexity of speech inputs and cross-lingual generation. To address these challenges, we propose StreamUni, which achieves StreamST through a unified Large Speech-Language Model (LSLM). Specifically, StreamUni incorporates speech Chain-of-Thought (CoT) in guiding the LSLM to generate multi-stage outputs. Leveraging these multi-stage outputs, StreamUni simultaneously accomplishes speech segmentation, policy decision, and translation generation, completing StreamST without requiring massive policy-specific training. Additionally, we propose a streaming CoT training method that enhances low-latency policy decisions and generation capabilities using limited CoT data. Experiments demonstrate that our approach achieves state-of-the-art performance on StreamST tasks.
Related papers
- Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model [76.06585781346601]
Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model.<n>The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality.
arXiv Detail & Related papers (2025-06-04T23:53:49Z) - SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation [14.57248739077317]
This paper proposes SimulS2S-LLM, which trains speech LLMs offline and employs a test-time policy to guide simultaneous inference.<n>SimulS2S-LLM achieves simultaneous speech-to-speech translation (Simul-S2ST) by predicting discrete output speech tokens and then synthesising output speech using a pre-trained vocoder.
arXiv Detail & Related papers (2025-04-22T01:05:32Z) - Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture [14.056534007451763]
Simultaneous speech translation (SimulST) produces translations incrementally while processing partial speech input.<n>Existing LLM-based SimulST approaches incur significant computational overhead due to repeated encoding of bidirectional speech encoder.<n>We introduce Efficient and Adaptive Simultaneous Speech Translation (EASiST) with fully unidirectional architecture.
arXiv Detail & Related papers (2025-04-16T06:46:15Z) - StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning [48.84039953531356]
StreamSpeech is a direct Simul-S2ST model that jointly learns translation and simultaneous policy.
Experiments on CVSS benchmark demonstrate that StreamSpeech achieves state-of-the-art performance in both offline S2ST and Simul-S2ST tasks.
arXiv Detail & Related papers (2024-06-05T08:24:22Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Speech Translation with Large Language Models: An Industrial Practice [64.5419534101104]
We introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained large language model (LLM)
By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations.
Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST.
arXiv Detail & Related papers (2023-12-21T05:32:49Z) - Leveraging Timestamp Information for Serialized Joint Streaming
Recognition and Translation [51.399695200838586]
We propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder.
Experiments on it,es,de->en prove the effectiveness of our approach, enabling the generation of one-to-many joint outputs with a single decoder for the first time.
arXiv Detail & Related papers (2023-10-23T11:00:27Z) - Token-Level Serialized Output Training for Joint Streaming ASR and ST
Leveraging Textual Alignments [49.38965743465124]
This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder.
Experiments in monolingual and multilingual settings demonstrate that our approach achieves the best quality-latency balance.
arXiv Detail & Related papers (2023-07-07T02:26:18Z) - Adapting Offline Speech Translation Models for Streaming with
Future-Aware Distillation and Inference [34.50987690518264]
A popular approach to streaming speech translation is to employ a single offline model with a wait-k policy to support different latency requirements.
There is a mismatch problem in using a model trained with complete utterances for streaming inference with partial input.
We propose a new approach called Future-Aware Streaming Translation (FAST) that adapts an offline ST model for streaming input.
arXiv Detail & Related papers (2023-03-14T13:56:36Z) - UniST: Unified End-to-end Model for Streaming and Non-streaming Speech
Translation [12.63410397982031]
We develop a unified model (UniST) which supports streaming and non-streaming speech translation.
Experiments on the most popular speech-to-text translation benchmark dataset, MuST-C, show that UniST achieves significant improvement for non-streaming ST.
arXiv Detail & Related papers (2021-09-15T15:22:10Z) - RealTranS: End-to-End Simultaneous Speech Translation with Convolutional
Weighted-Shrinking Transformer [33.876412404781846]
RealTranS is an end-to-end model for simultaneous speech translation.
It maps speech features into text space with a weighted-shrinking operation and a semantic encoder.
Experiments show that RealTranS with the Wait-K-Stride-N strategy outperforms prior end-to-end models.
arXiv Detail & Related papers (2021-06-09T06:35:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.