Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice
- URL: http://arxiv.org/abs/2507.17527v3
- Date: Sun, 27 Jul 2025 05:17:25 GMT
- Title: Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice
- Authors: Shanbo Cheng, Yu Bao, Zhichao Huang, Yu Lu, Ningxin Peng, Lu Xu, Runsheng Yu, Rong Cao, Yujiao Du, Ting Han, Yuxiang Hu, Zeyang Li, Sitong Liu, Shengtao Ma, Shiguang Pan, Jiongchen Xiao, Nuo Xu, Meng Yang, Rong Ye, Yiming Yu, Jun Zhang, Ruofei Zhang, Wanyi Zhang, Wenhao Zhu, Liehao Zou, Lu Lu, Yuxuan Wang, Yonghui Wu,
- Abstract summary: Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry.<n>We introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities.
- Score: 52.747242157396315
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework. Experimental results demonstrate that through large-scale pretraining and reinforcement learning, the model achieves a significantly better balance between translation accuracy and latency, validated by human interpreters to exceed 70% correctness in complex scenarios. Notably, Seed-LiveInterpret 2.0 outperforms commercial SI solutions by significant margins in translation quality, while slashing the average latency of cloned speech from nearly 10 seconds to a near-real-time 3 seconds, which is around a near 70% reduction that drastically enhances practical usability.
Related papers
- Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing [15.134076873312809]
Cross-lingual dubbing system translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed.<n>We propose a discrete diffusion-based speech-to-unit translation model with explicit duration control, enabling time-aligned translation.<n>We then synthesize speech based on the predicted units and source identity with a conditional flow matching model.
arXiv Detail & Related papers (2025-05-27T08:43:28Z) - Direct Speech to Speech Translation: A Review [0.0]
Speech to speech translation (S2ST) is a transformative technology that bridges global communication gaps.<n>Traditional cascade models that rely on automatic speech recognition (ASR), machine translation (MT), and text to speech (TTS) components suffer from error propagation, increased latency, and loss of prosody.<n>Direct S2ST models retain speaker identity, reduce latency, and improve translation naturalness by preserving vocal characteristics and prosody.
arXiv Detail & Related papers (2025-03-03T06:48:22Z) - A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X)
NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework.
It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - Joint Pre-Training with Speech and Bilingual Text for Direct Speech to
Speech Translation [94.80029087828888]
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST.
Direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.
We propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks.
arXiv Detail & Related papers (2022-10-31T02:55:51Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.