Related papers: SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation

SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation

URL: http://arxiv.org/abs/2509.01200v2
Date: Wed, 29 Oct 2025 17:02:41 GMT
Title: SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation
Authors: Chenyang Le, Bing Han, Jinshun Li, Songyong Chen, Yanmin Qian,
Abstract summary: SimulST enables real-time cross-lingual communication by jointly optimizing speech recognition and machine translation under strict latency constraints.<n>We present SimulMEGA, an unsupervised policy learning framework that combines prefix-based training with a Mixture-of-Experts refiner to learn effective read and write decisions.
Score: 41.64909735021069
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Simultaneous Speech Translation (SimulST) enables real-time cross-lingual communication by jointly optimizing speech recognition and machine translation under strict latency constraints. Existing systems struggle to balance translation quality, latency, and semantic coherence, particularly in multilingual many-to-many scenarios where divergent read and write policies hinder unified strategy learning. In this paper, we present SimulMEGA (Simultaneous Generation by Mixture-of-Experts Gating), an unsupervised policy learning framework that combines prefix-based training with a Mixture-of-Experts refiner to learn effective read and write decisions in an implicit manner, without adding inference-time overhead. Our design requires only minimal modifications to standard transformer architectures and generalizes across both speech-to-text and text-to-speech streaming tasks. Through comprehensive evaluation on six language pairs, our 500M parameter speech-to-text model outperforms the Seamless baseline, achieving under 7 percent BLEU degradation at 1.5 seconds average lag and under 3 percent at 3 seconds. We further demonstrate the versatility of SimulMEGA by extending it to streaming TTS with a unidirectional backbone, yielding superior latency quality tradeoffs.

Related papers

Simultaneous Speech-to-Speech Translation Without Aligned Data [52.467808474293605]
Simultaneous speech translation requires translating source speech into a target language in real-time.<n>We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely.<n>Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks.
arXiv Detail & Related papers (2026-02-11T17:41:01Z)
Redefining Machine Simultaneous Interpretation: From Incremental Translation to Human-Like Strategies [6.010207559477024]
Simultaneous Machine Translation (SiMT) requires high-quality translations under strict real-time constraints.<n>We extend the action space of SiMT with four adaptive actions: Sentence_Cut, Drop, Partial_Summarization and Pronominalization.<n>We adapt these actions in a large language model (LLM) framework and construct training references through action-aware prompting.
arXiv Detail & Related papers (2026-01-16T05:26:16Z)
Direct Simultaneous Translation Activation for Large Audio-Language Models [58.03785696031301]
Simultaneous speech-to-text translation (Simul-S2TT) aims to translate speech into target text in real time.<n>We introduce bf Simultaneous bf Self-bf Augmentation (bf SimulSA), a strategy that utilizes LALMs' inherent capabilities to obtain simultaneous data.
arXiv Detail & Related papers (2025-09-19T07:12:18Z)
Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice [52.747242157396315]
Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry.<n>We introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities.
arXiv Detail & Related papers (2025-07-23T14:07:41Z)
MLLP-VRAIN UPV system for the IWSLT 2025 Simultaneous Speech Translation Translation task [7.247809853198223]
This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2025 Simultaneous Speech Translation track.<n>Our submission addresses the unique challenges of real-time translation of long-form speech by developing a modular cascade system.
arXiv Detail & Related papers (2025-06-23T16:44:01Z)
Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture [14.056534007451763]
Simultaneous speech translation (SimulST) produces translations incrementally while processing partial speech input.<n>Existing LLM-based SimulST approaches incur significant computational overhead due to repeated encoding of bidirectional speech encoder.<n>We introduce Efficient and Adaptive Simultaneous Speech Translation (EASiST) with fully unidirectional architecture.
arXiv Detail & Related papers (2025-04-16T06:46:15Z)
A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X) NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z)
Shiftable Context: Addressing Training-Inference Context Mismatch in Simultaneous Speech Translation [0.17188280334580192]
Transformer models using segment-based processing have been an effective architecture for simultaneous speech translation. We propose Shiftable Context to ensure consistent segment and context sizes are maintained throughout training and inference.
arXiv Detail & Related papers (2023-07-03T22:11:51Z)
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation. We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices. TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z)
Decision Attentive Regularization to Improve Simultaneous Speech Translation Systems [12.152208198444182]
Simultaneous Speech-to-text Translation (SimulST) systems translate source speech in tandem with the speaker using partial input. Recent works have tried to leverage the text translation task to improve the performance of Speech Translation (ST) in the offline domain. Motivated by these improvements, we propose to add Decision Attentive Regularization (DAR) to Monotonic Multihead Attention (MMA) based SimulST systems.
arXiv Detail & Related papers (2021-10-13T08:33:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.