Related papers: Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation

Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation

URL: http://arxiv.org/abs/2509.17349v1
Date: Mon, 22 Sep 2025 04:21:19 GMT
Title: Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation
Authors: Peter Polák, Sara Papi, Luisa Bentivogli, Ondřej Bojar,
Abstract summary: Simultaneous speech-to-text translation (SimulST) systems have to balance translation quality with latency.<n>Existing metrics often produce inconsistent or misleading results.<n>We present the first comprehensive analysis of SimulST latency metrics across language pairs, systems, and both short- and long-form regimes.
Score: 13.949286462892212
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Simultaneous speech-to-text translation (SimulST) systems have to balance translation quality with latency--the delay between speech input and the translated output. While quality evaluation is well established, accurate latency measurement remains a challenge. Existing metrics often produce inconsistent or misleading results, especially in the widely used short-form setting, where speech is artificially presegmented. In this paper, we present the first comprehensive analysis of SimulST latency metrics across language pairs, systems, and both short- and long-form regimes. We uncover a structural bias in current metrics related to segmentation that undermines fair and meaningful comparisons. To address this, we introduce YAAL (Yet Another Average Lagging), a refined latency metric that delivers more accurate evaluations in the short-form regime. We extend YAAL to LongYAAL for unsegmented audio and propose SoftSegmenter, a novel resegmentation tool based on word-level alignment. Our experiments show that YAAL and LongYAAL outperform popular latency metrics, while SoftSegmenter enhances alignment quality in long-form evaluation, together enabling more reliable assessments of SimulST systems.

Related papers

Simultaneous Speech-to-Speech Translation Without Aligned Data [52.467808474293605]
Simultaneous speech translation requires translating source speech into a target language in real-time.<n>We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely.<n>Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks.
arXiv Detail & Related papers (2026-02-11T17:41:01Z)
Redefining Machine Simultaneous Interpretation: From Incremental Translation to Human-Like Strategies [6.010207559477024]
Simultaneous Machine Translation (SiMT) requires high-quality translations under strict real-time constraints.<n>We extend the action space of SiMT with four adaptive actions: Sentence_Cut, Drop, Partial_Summarization and Pronominalization.<n>We adapt these actions in a large language model (LLM) framework and construct training references through action-aware prompting.
arXiv Detail & Related papers (2026-01-16T05:26:16Z)
Redefining Machine Simultaneous Interpretation: From Incremental Translation to Human-Like Strategies [4.487634497356904]
Simultaneous Machine Translation (SiMT) requires high-quality translations under strict real-time constraints.<n>We extend the action space of SiMT with four adaptive actions: SENTENCE_CUT, DROP, PARTIAL_MARIZATION and PRONOMINALIZATION.<n>We implement these actions in a decoder-only large language model (LLM) framework and construct training references through action-aware prompting.
arXiv Detail & Related papers (2025-09-26T02:57:36Z)
CA*: Addressing Evaluation Pitfalls in Computation-Aware Latency for Simultaneous Speech Translation [17.473263201972483]
Simultaneous speech translation (SimulST) systems must balance translation quality with response time. There has been a longstanding belief that current metrics yield unrealistically high latency measurements in unsegmented streaming settings.
arXiv Detail & Related papers (2024-10-21T13:42:19Z)
Average Token Delay: A Duration-aware Latency Metric for Simultaneous Translation [16.954965417930254]
We propose a novel latency evaluation metric for simultaneous translation called emphAverage Token Delay (ATD) We demonstrate its effectiveness through analyses simulating user-side latency based on Ear-Voice Span (EVS)
arXiv Detail & Related papers (2023-11-24T08:53:52Z)
DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution. It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector. Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z)
End-to-End Evaluation for Low-Latency Simultaneous Speech Translation [55.525125193856084]
We propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions.<n>This includes the segmentation of the audio as well as the run-time of the different components.<n>We also compare different approaches to low-latency speech translation using this framework.
arXiv Detail & Related papers (2023-08-07T09:06:20Z)
Average Token Delay: A Latency Metric for Simultaneous Translation [21.142539715996673]
We propose a novel latency evaluation metric called Average Token Delay (ATD) We discuss the advantage of ATD using simulated examples and also investigate the differences between ATD and Average Lagging with simultaneous translation experiments.
arXiv Detail & Related papers (2022-11-22T06:45:13Z)
SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations. We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences. Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z)
Over-Generation Cannot Be Rewarded: Length-Adaptive Average Lagging for Simultaneous Speech Translation [17.305879157385675]
Simultaneous speech translation (SimulST) systems aim at generating their output with the lowest possible latency. Average Lagging (AL) provides underestimated scores for systems that generate longer predictions compared to the corresponding references. We show that this problem has practical relevance, as recent SimulST systems have indeed a tendency to over-generate.
arXiv Detail & Related papers (2022-06-12T18:00:08Z)
A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video. Recent studies have found that current benchmark datasets may have obvious moment annotation biases. We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z)
SimulEval: An Evaluation Toolkit for Simultaneous Translation [59.02724214432792]
Simultaneous translation on both text and speech focuses on a real-time and low-latency scenario. SimulEval is an easy-to-use and general evaluation toolkit for both simultaneous text and speech translation.
arXiv Detail & Related papers (2020-07-31T17:44:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.