Translation Canvas: An Explainable Interface to Pinpoint and Analyze Translation Systems
- URL: http://arxiv.org/abs/2410.10861v3
- Date: Sun, 20 Oct 2024 22:10:57 GMT
- Title: Translation Canvas: An Explainable Interface to Pinpoint and Analyze Translation Systems
- Authors: Chinmay Dandekar, Wenda Xu, Xi Xu, Siqi Ouyang, Lei Li,
- Abstract summary: We introduce Translation Canvas, an explainable interface designed to pinpoint and analyze translation systems' performance.
It supports fine-grained analysis by highlighting error spans with explanations and selectively displaying systems' predictions.
According to human evaluation, Translation Canvas demonstrates superior performance over COMET and SacreBLEU packages.
- Score: 16.102196839755823
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid advancement of machine translation research, evaluation toolkits have become essential for benchmarking system progress. Tools like COMET and SacreBLEU offer single quality score assessments that are effective for pairwise system comparisons. However, these tools provide limited insights for fine-grained system-level comparisons and the analysis of instance-level defects. To address these limitations, we introduce Translation Canvas, an explainable interface designed to pinpoint and analyze translation systems' performance: 1) Translation Canvas assists machine translation researchers in comprehending system-level model performance by identifying common errors (their frequency and severity) and analyzing relationships between different systems based on various evaluation metrics. 2) It supports fine-grained analysis by highlighting error spans with explanations and selectively displaying systems' predictions. According to human evaluation, Translation Canvas demonstrates superior performance over COMET and SacreBLEU packages under enjoyability and understandability criteria.
Related papers
- DeformAr: Rethinking NER Evaluation through Component Analysis and Visual Analytics [0.0]
We introduce DeformAr, a framework to investigate and explain the performance discrepancy between Arabic and English NER systems.<n>DeformAr is the first Arabic-specific, component-based interpretability tool, offering a crucial resource for advancing model analysis in under-resourced languages.
arXiv Detail & Related papers (2025-11-30T15:39:28Z) - Estimating Machine Translation Difficulty [48.659971048116766]
We formalize the task of translation difficulty estimation, defining a text's difficulty based on the expected quality of its translations.<n>We demonstrate the practical utility of difficulty estimators by using them to construct more challenging benchmarks for machine translation.<n>We release two improved models for difficulty estimation, Sentinel-src-24 and Sentinel-src-25.
arXiv Detail & Related papers (2025-08-13T20:22:58Z) - MT-LENS: An all-in-one Toolkit for Better Machine Translation Evaluation [1.7775825387442485]
MT-LENS is a framework designed to evaluate Machine Translation (MT) systems across a variety of tasks.
It offers a user-friendly platform to compare systems and analyze translations with interactive visualizations.
arXiv Detail & Related papers (2024-12-16T09:57:28Z) - A Context-aware Framework for Translation-mediated Conversations [29.169155271343083]
We present a framework to improve large language model-based translation systems by incorporating contextual information in bilingual conversational settings during training and inference.<n>We validate our proposed framework on two task-oriented domains: customer chat and user-assistant interaction.<n>Across both settings, the system produced by our framework-TowerChat-consistently results in better translations than state-of-the-art systems like GPT-4o and TowerInstruct.
arXiv Detail & Related papers (2024-12-05T14:41:05Z) - xCOMET: Transparent Machine Translation Evaluation through Fine-grained
Error Detection [21.116517555282314]
xCOMET is an open-source learned metric designed to bridge the gap between machine translation evaluation approaches.
It integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation.
We also provide a robustness analysis with stress tests, and show that xCOMET is largely capable of identifying localized critical errors and hallucinations.
arXiv Detail & Related papers (2023-10-16T15:03:14Z) - End-to-End Evaluation for Low-Latency Simultaneous Speech Translation [55.525125193856084]
We propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions.
This includes the segmentation of the audio as well as the run-time of the different components.
We also compare different approaches to low-latency speech translation using this framework.
arXiv Detail & Related papers (2023-08-07T09:06:20Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Quality Estimation of Machine Translated Texts based on Direct Evidence
from Training Data [0.0]
We show that the parallel corpus used as training data for training the MT system holds direct clues for estimating the quality of translations produced by the MT system.
Our experiments show that this simple and direct method holds promise for quality estimation of translations produced by any purely data driven machine translation system.
arXiv Detail & Related papers (2023-06-27T11:52:28Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z) - Can Your Context-Aware MT System Pass the DiP Benchmark Tests? :
Evaluation Benchmarks for Discourse Phenomena in Machine Translation [7.993547048820065]
We introduce the first of their kind MT benchmark datasets that aim to track and hail improvements across four main discourse phenomena.
Surprisingly, we find that existing context-aware models do not improve discourse-related translations consistently across languages and phenomena.
arXiv Detail & Related papers (2020-04-30T07:15:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.