Related papers: Towards the evaluation of simultaneous speech translation from a communicative perspective

Towards the evaluation of simultaneous speech translation from a communicative perspective

URL: http://arxiv.org/abs/2103.08364v1
Date: Mon, 15 Mar 2021 13:09:00 GMT
Title: Towards the evaluation of simultaneous speech translation from a communicative perspective
Authors: claudio Fantinuoli, Bianca Prandi
Abstract summary: We present the results of an experiment aimed at evaluating the quality of a simultaneous speech translation engine. We found better performance for the human interpreters in terms of intelligibility, while the machine performs slightly better in terms of informativeness.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, machine speech-to-speech and speech-to-text translation has gained momentum thanks to advances in artificial intelligence, especially in the domains of speech recognition and machine translation. The quality of such applications is commonly tested with automatic metrics, such as BLEU, primarily with the goal of assessing improvements of releases or in the context of evaluation campaigns. However, little is known about how such systems compare to human performances in similar communicative tasks or how the performance of such systems is perceived by final users. In this paper, we present the results of an experiment aimed at evaluating the quality of a simultaneous speech translation engine by comparing it to the performance of professional interpreters. To do so, we select a framework developed for the assessment of human interpreters and use it to perform a manual evaluation on both human and machine performances. In our sample, we found better performance for the human interpreters in terms of intelligibility, while the machine performs slightly better in terms of informativeness. The limitations of the study and the possible enhancements of the chosen framework are discussed. Despite its intrinsic limitations, the use of this framework represents a first step towards a user-centric and communication-oriented methodology for evaluating simultaneous speech translation.

Related papers

SpeechQE: Estimating the Quality of Direct Speech Translation [23.83384136789891]
We formulate the task of quality estimation for speech translation (SpeechQE), construct a benchmark, and evaluate a family of systems based on cascaded and end-to-end architectures. Results suggest end-to-end approaches are better suited to estimating the quality of direct speech translation than using quality estimation systems designed for text in cascaded systems.
arXiv Detail & Related papers (2024-10-28T19:50:04Z)
Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance. We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information. Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z)
Exploring the Correlation between Human and Machine Evaluation of Simultaneous Speech Translation [0.9576327614980397]
This study aims to assess the reliability of automatic metrics in evaluating simultaneous interpretations by analyzing their correlation with human evaluations. As a benchmark we use human assessments performed by language experts, and evaluate how well sentence embeddings and Large Language Models correlate with them. The results suggest GPT models, particularly GPT-3.5 with direct prompting, demonstrate the strongest correlation with human judgment in terms of semantic similarity between source and target texts.
arXiv Detail & Related papers (2024-06-14T14:47:19Z)
Is Context Helpful for Chat Translation Evaluation? [23.440392979857247]
We conduct a meta-evaluation of existing sentence-level automatic metrics to assess the quality of machine-translated chats. We find that reference-free metrics lag behind reference-based ones, especially when evaluating translation quality in out-of-English settings. We propose a new evaluation metric, Context-MQM, that utilizes bilingual context with a large language model.
arXiv Detail & Related papers (2024-03-13T07:49:50Z)
End-to-End Evaluation for Low-Latency Simultaneous Speech Translation [55.525125193856084]
We propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions. This includes the segmentation of the audio as well as the run-time of the different components. We also compare different approaches to low-latency speech translation using this framework.
arXiv Detail & Related papers (2023-08-07T09:06:20Z)
Beyond the Tip of the Iceberg: Assessing Coherence of Text Classifiers [0.05857406612420462]
Large-scale, pre-trained language models achieve human-level and superhuman accuracy on existing language understanding tasks. We propose evaluating systems through a novel measure of prediction coherence.
arXiv Detail & Related papers (2021-09-10T15:04:23Z)
Improving Cross-Lingual Reading Comprehension with Self-Training [62.73937175625953]
Current state-of-the-art models even surpass human performance on several benchmarks. Previous works have revealed the abilities of pre-trained multilingual models for zero-shot cross-lingual reading comprehension. This paper further utilized unlabeled data to improve the performance.
arXiv Detail & Related papers (2021-05-08T08:04:30Z)
Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation. This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them. In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z)
Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment. We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
A Set of Recommendations for Assessing Human-Machine Parity in Language Translation [87.72302201375847]
We reassess Hassan et al.'s investigation into Chinese to English news translation. We show that the professional human translations contained significantly fewer errors.
arXiv Detail & Related papers (2020-04-03T17:49:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.