Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality
- URL: http://arxiv.org/abs/2509.14023v1
- Date: Wed, 17 Sep 2025 14:27:17 GMT
- Title: Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality
- Authors: Sami Ul Haq, Sheila Castilho, Yvette Graham,
- Abstract summary: Machine Translation (MT) has achieved remarkable performance, with growing interest in speech translation and multimodal approaches.<n>This study compares text-only and audio-based evaluations of 10 MT systems from the WMT General MT Shared Task.<n>Crowd-sourced assessments based on audio yield rankings largely consistent with text only evaluations but, in some cases, identify significant differences between translation systems.
- Score: 3.7842130830138943
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine Translation (MT) has achieved remarkable performance, with growing interest in speech translation and multimodal approaches. However, despite these advancements, MT quality assessment remains largely text centric, typically relying on human experts who read and compare texts. Since many real-world MT applications (e.g Google Translate Voice Mode, iFLYTEK Translator) involve translation being spoken rather printed or read, a more natural way to assess translation quality would be through speech as opposed text-only evaluations. This study compares text-only and audio-based evaluations of 10 MT systems from the WMT General MT Shared Task, using crowd-sourced judgments collected via Amazon Mechanical Turk. We additionally, performed statistical significance testing and self-replication experiments to test reliability and consistency of audio-based approach. Crowd-sourced assessments based on audio yield rankings largely consistent with text only evaluations but, in some cases, identify significant differences between translation systems. We attribute this to speech richer, more natural modality and propose incorporating speech-based assessments into future MT evaluation frameworks.
Related papers
- Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics [69.2321983942375]
This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings.<n>We examine how each metric responds to translation artifacts, including hallucinations, repetition, source-text copying, and diacritic (textitmatra) variations across three ELRLs: Magahi, Bhojpuri, and Chhattisgarhi.<n>While recent work often relies solely on ChrF++, our findings show that BLEU, despite its lower absolute scores, provides complementary lexical-precision insights that improve interpretability.
arXiv Detail & Related papers (2026-02-19T14:56:42Z) - Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation [57.11989521509119]
We propose a novel agentic translation evaluation framework, centered by a reflective Core Agent that invokes specialized sub-agents.<n> Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 meta score compared with current metrics.
arXiv Detail & Related papers (2026-01-12T09:03:42Z) - How to Evaluate Speech Translation with Source-Aware Neural MT Metrics [32.41110835446445]
In machine translation, neural metrics incorporating the source text achieve stronger correlation with human judgments.<n>In this work, we conduct the first systematic study of source-aware metrics for speech-to-text translation.<n>We introduce a novel two-step cross-lingual re-segmentation algorithm to address the alignment mismatch between synthetic sources and reference translations.
arXiv Detail & Related papers (2025-11-05T08:49:22Z) - Languages Still Left Behind: Toward a Better Multilingual Machine Translation Benchmark [11.068031181100276]
We study data in four languages (Asante Twi, Japanese, Jinghpaw, and South Azerbaijani)<n>We uncover critical shortcomings in the benchmark's suitability for truly multilingual evaluation.<n>We advocate for multilingual MT benchmarks that use domain-general and culturally neutral source texts.
arXiv Detail & Related papers (2025-08-28T07:52:42Z) - Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering [68.3400058037817]
We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality.<n>We show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations.
arXiv Detail & Related papers (2025-04-10T09:24:54Z) - BiVert: Bidirectional Vocabulary Evaluation using Relations for Machine
Translation [4.651581292181871]
We propose a bidirectional semantic-based evaluation method designed to assess the sense distance of the translation from the source text.
This approach employs the comprehensive multilingual encyclopedic dictionary BabelNet.
Factual analysis shows a strong correlation between the average evaluation scores generated by our method and the human assessments across various machine translation systems for English-German language pair.
arXiv Detail & Related papers (2024-03-06T08:02:21Z) - End-to-End Evaluation for Low-Latency Simultaneous Speech Translation [55.525125193856084]
We propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions.<n>This includes the segmentation of the audio as well as the run-time of the different components.<n>We also compare different approaches to low-latency speech translation using this framework.
arXiv Detail & Related papers (2023-08-07T09:06:20Z) - Discourse Centric Evaluation of Machine Translation with a Densely
Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al.
We investigate the similarities and differences between the discourse structures of source and target languages.
We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z) - HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using
Professional Post-Editing Towards More Effective MT Evaluation [0.0]
In this work, we introduce HOPE, a task-oriented and human-centric evaluation framework for machine translation output.
It contains only a limited number of commonly occurring error types, and use a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit.
The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR.
arXiv Detail & Related papers (2021-12-27T18:47:43Z) - Measuring Uncertainty in Translation Quality Evaluation (TQE) [62.997667081978825]
This work carries out motivated research to correctly estimate the confidence intervals citeBrown_etal2001Interval depending on the sample size of the translated text.
The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA)
arXiv Detail & Related papers (2021-11-15T12:09:08Z) - Can Your Context-Aware MT System Pass the DiP Benchmark Tests? :
Evaluation Benchmarks for Discourse Phenomena in Machine Translation [7.993547048820065]
We introduce the first of their kind MT benchmark datasets that aim to track and hail improvements across four main discourse phenomena.
Surprisingly, we find that existing context-aware models do not improve discourse-related translations consistently across languages and phenomena.
arXiv Detail & Related papers (2020-04-30T07:15:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.