xCOMET: Transparent Machine Translation Evaluation through Fine-grained
Error Detection
- URL: http://arxiv.org/abs/2310.10482v1
- Date: Mon, 16 Oct 2023 15:03:14 GMT
- Title: xCOMET: Transparent Machine Translation Evaluation through Fine-grained
Error Detection
- Authors: Nuno M. Guerreiro, Ricardo Rei, Daan van Stigt, Luisa Coheur, Pierre
Colombo, Andr\'e F.T. Martins
- Abstract summary: xCOMET is an open-source learned metric designed to bridge the gap between machine translation evaluation approaches.
It integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation.
We also provide a robustness analysis with stress tests, and show that xCOMET is largely capable of identifying localized critical errors and hallucinations.
- Score: 21.116517555282314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Widely used learned metrics for machine translation evaluation, such as COMET
and BLEURT, estimate the quality of a translation hypothesis by providing a
single sentence-level score. As such, they offer little insight into
translation errors (e.g., what are the errors and what is their severity). On
the other hand, generative large language models (LLMs) are amplifying the
adoption of more granular strategies to evaluation, attempting to detail and
categorize translation errors. In this work, we introduce xCOMET, an
open-source learned metric designed to bridge the gap between these approaches.
xCOMET integrates both sentence-level evaluation and error span detection
capabilities, exhibiting state-of-the-art performance across all types of
evaluation (sentence-level, system-level, and error span detection). Moreover,
it does so while highlighting and categorizing error spans, thus enriching the
quality assessment. We also provide a robustness analysis with stress tests,
and show that xCOMET is largely capable of identifying localized critical
errors and hallucinations.
Related papers
- MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators [53.91199933655421]
Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment.
We introduce a universal and training-free framework, $textbfMQM-APE, to enhance the quality of error annotations predicted by LLM evaluators.
arXiv Detail & Related papers (2024-09-22T06:43:40Z) - xTower: A Multilingual LLM for Explaining and Correcting Translation Errors [22.376508000237042]
xTower is an open large language model (LLM) built on top of TowerBase to provide free-text explanations for translation errors.
We test xTower across various experimental setups in generating translation corrections, demonstrating significant improvements in translation quality.
arXiv Detail & Related papers (2024-06-27T18:51:46Z) - Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors [11.07539342949602]
We propose an end-to-end framework for detecting factual errors in text summarization.
Our framework uses a diverse set of LLM prompts to identify factual inconsistencies.
We calibrate the ensembled models to produce empirically accurate probabilities that a text is factually consistent or free of hallucination.
arXiv Detail & Related papers (2024-06-18T18:59:37Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - SUT: Active Defects Probing for Transcompiler Models [24.01532199512389]
We introduce a new metrics for programming language translation and these metrics address basic syntax errors.
Experiments have shown that even powerful models like ChatGPT still make mistakes on these basic unit tests.
arXiv Detail & Related papers (2023-10-22T07:16:02Z) - Towards Fine-Grained Information: Identifying the Type and Location of
Translation Errors [80.22825549235556]
Existing approaches can not synchronously consider error position and type.
We build an FG-TED model to predict the textbf addition and textbfomission errors.
Experiments show that our model can identify both error type and position concurrently, and gives state-of-the-art results.
arXiv Detail & Related papers (2023-02-17T16:20:33Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - Understanding Factual Errors in Summarization: Errors, Summarizers,
Datasets, Error Detectors [105.12462629663757]
In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model.
We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models.
arXiv Detail & Related papers (2022-05-25T15:26:48Z) - Detecting over/under-translation errors for determining adequacy in
human translations [0.0]
We present a novel approach to detecting over and under translations (OT/UT) as part of adequacy error checks in translation evaluation.
We do not restrict ourselves to machine translation (MT) outputs and specifically target applications with human generated translation pipeline.
The goal of our system is to identify OT/UT errors from human translated video subtitles with high error recall.
arXiv Detail & Related papers (2021-04-01T06:06:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.