SimulSense: Sense-Driven Interpreting for Efficient Simultaneous Speech Translation
- URL: http://arxiv.org/abs/2509.21932v1
- Date: Fri, 26 Sep 2025 06:18:10 GMT
- Title: SimulSense: Sense-Driven Interpreting for Efficient Simultaneous Speech Translation
- Authors: Haotian Tan, Hiroki Ouchi, Sakriani Sakti,
- Abstract summary: Current state-of-the-art systems formulate SimulST as a multi-turn dialogue task.<n>We propose SimulSense, a novel framework for SimulST that mimics human interpreters by continuously reading input speech.<n>Our proposed method achieves a superior quality-latency tradeoff and substantially improved real-time efficiency.
- Score: 18.064708420260228
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How to make human-interpreter-like read/write decisions for simultaneous speech translation (SimulST) systems? Current state-of-the-art systems formulate SimulST as a multi-turn dialogue task, requiring specialized interleaved training data and relying on computationally expensive large language model (LLM) inference for decision-making. In this paper, we propose SimulSense, a novel framework for SimulST that mimics human interpreters by continuously reading input speech and triggering write decisions to produce translation when a new sense unit is perceived. Experiments against two state-of-the-art baseline systems demonstrate that our proposed method achieves a superior quality-latency tradeoff and substantially improved real-time efficiency, where its decision-making is up to 9.6x faster than the baselines.
Related papers
- Vision-Grounded Machine Interpreting: Improving the Translation Process through Visual Cues [0.0]
Vision-Grounded Interpreting (VGI) is a novel approach designed to address the limitations of unimodal machine interpreting.<n>We present a prototype system that integrates a vision-language model to process both speech and visual input from a webcam.<n>To evaluate the effectiveness of this approach, we constructed a hand-crafted diagnostic corpus targeting three types of ambiguity.
arXiv Detail & Related papers (2025-09-28T16:25:33Z) - Direct Simultaneous Translation Activation for Large Audio-Language Models [58.03785696031301]
Simultaneous speech-to-text translation (Simul-S2TT) aims to translate speech into target text in real time.<n>We introduce bf Simultaneous bf Self-bf Augmentation (bf SimulSA), a strategy that utilizes LALMs' inherent capabilities to obtain simultaneous data.
arXiv Detail & Related papers (2025-09-19T07:12:18Z) - SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation [41.64909735021069]
SimulST enables real-time cross-lingual communication by jointly optimizing speech recognition and machine translation under strict latency constraints.<n>We present SimulMEGA, an unsupervised policy learning framework that combines prefix-based training with a Mixture-of-Experts refiner to learn effective read and write decisions.
arXiv Detail & Related papers (2025-09-01T07:34:50Z) - Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture [14.056534007451763]
Simultaneous speech translation (SimulST) produces translations incrementally while processing partial speech input.<n>Existing LLM-based SimulST approaches incur significant computational overhead due to repeated encoding of bidirectional speech encoder.<n>We introduce Efficient and Adaptive Simultaneous Speech Translation (EASiST) with fully unidirectional architecture.
arXiv Detail & Related papers (2025-04-16T06:46:15Z) - Prosody in Cascade and Direct Speech-to-Text Translation: a case study
on Korean Wh-Phrases [79.07111754406841]
This work proposes using contrastive evaluation to measure the ability of direct S2TT systems to disambiguate utterances where prosody plays a crucial role.
Our results clearly demonstrate the value of direct translation systems over cascade translation models.
arXiv Detail & Related papers (2024-02-01T14:46:35Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - Continuous Rating as Reliable Human Evaluation of Simultaneous Speech Translation [1.3654846342364308]
We compare Continuous Rating with factual questionnaires on judges with different levels of source language knowledge.
Our results show that Continuous Rating is easy and reliable SST quality assessment if the judges have at least limited knowledge of the source language.
arXiv Detail & Related papers (2022-03-04T17:41:39Z) - Decision Attentive Regularization to Improve Simultaneous Speech
Translation Systems [12.152208198444182]
Simultaneous Speech-to-text Translation (SimulST) systems translate source speech in tandem with the speaker using partial input.
Recent works have tried to leverage the text translation task to improve the performance of Speech Translation (ST) in the offline domain.
Motivated by these improvements, we propose to add Decision Attentive Regularization (DAR) to Monotonic Multihead Attention (MMA) based SimulST systems.
arXiv Detail & Related papers (2021-10-13T08:33:31Z) - It is Not as Good as You Think! Evaluating Simultaneous Machine
Translation on Interpretation Data [58.105938143865906]
We argue that SiMT systems should be trained and tested on real interpretation data.
Our results highlight the difference of up-to 13.83 BLEU score when SiMT models are evaluated on translation vs interpretation data.
arXiv Detail & Related papers (2021-10-11T12:27:07Z) - Towards the evaluation of simultaneous speech translation from a
communicative perspective [0.0]
We present the results of an experiment aimed at evaluating the quality of a simultaneous speech translation engine.
We found better performance for the human interpreters in terms of intelligibility, while the machine performs slightly better in terms of informativeness.
arXiv Detail & Related papers (2021-03-15T13:09:00Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - SimulEval: An Evaluation Toolkit for Simultaneous Translation [59.02724214432792]
Simultaneous translation on both text and speech focuses on a real-time and low-latency scenario.
SimulEval is an easy-to-use and general evaluation toolkit for both simultaneous text and speech translation.
arXiv Detail & Related papers (2020-07-31T17:44:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.