Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies
- URL: http://arxiv.org/abs/2401.06760v2
- Date: Mon, 10 Jun 2024 12:57:48 GMT
- Title: Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies
- Authors: Tom Kocmi, Vilém Zouhar, Christian Federmann, Matt Post,
- Abstract summary: Ten years ago a single metric, BLEU, governed progress in machine translation research.
This paper investigates the "dynamic range" of modern metrics.
- Score: 24.26653413077486
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ten years ago a single metric, BLEU, governed progress in machine translation research. For better or worse, there is no such consensus today, and consequently it is difficult for researchers to develop and retain the kinds of heuristic intuitions about metric deltas that drove earlier research and deployment decisions. This paper investigates the "dynamic range" of a number of modern metrics in an effort to provide a collective understanding of the meaning of differences in scores both within and among metrics; in other words, we ask what point difference X in metric Y is required between two systems for humans to notice? We conduct our evaluation on a new large dataset, ToShip23, using it to discover deltas at which metrics achieve system-level differences that are meaningful to humans, which we measure by pairwise system accuracy. We additionally show that this method of establishing delta-accuracy is more stable than the standard use of statistical p-values in regards to testset size. Where data size permits, we also explore the effect of metric deltas and accuracy across finer-grained features such as translation direction, domain, and system closeness.
Related papers
- Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics.
This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings.
We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z) - Evaluating Automatic Metrics with Incremental Machine Translation Systems [55.78547133890403]
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions.
We assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations.
arXiv Detail & Related papers (2024-07-03T17:04:17Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - What is the Best Automated Metric for Text to Motion Generation? [19.71712698183703]
There is growing interest in generating skeleton-based human motions from natural language descriptions.
Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments.
This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better.
arXiv Detail & Related papers (2023-09-19T01:59:54Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z) - Suitability of Different Metric Choices for Concept Drift Detection [9.76294323004155]
Many unsupervised approaches for drift detection rely on measuring the discrepancy between the sample of two time windows.
Most drift detection methods can be distinguished in what metric they use, how this metric is estimated, and how the decision threshold is found.
We compare different types of estimators and metrics theoretically and empirically and investigate the relevance of the single metric components.
arXiv Detail & Related papers (2022-02-19T01:11:32Z) - To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for
Machine Translation [5.972205906525993]
We investigate which metrics have the highest accuracy to make system-level quality rankings for pairs of systems.
We show that the sole use of BLEU negatively affected the past development of improved models.
arXiv Detail & Related papers (2021-07-22T17:22:22Z) - Learning to Evaluate Perception Models Using Planner-Centric Metrics [104.33349410009161]
We propose a principled metric for 3D object detection specifically for the task of self-driving.
We find that our metric penalizes many of the mistakes that other metrics penalize by design.
For human evaluation, we generate scenes in which standard metrics and our metric disagree and find that humans side with our metric 79% of the time.
arXiv Detail & Related papers (2020-04-19T02:14:00Z) - Reliable Fidelity and Diversity Metrics for Generative Models [30.941563781926202]
The most widely used metric for measuring the similarity between real and generated images has been the Fr'echet Inception Distance (FID) score.
We show that even the latest version of the precision and recall metrics are not reliable yet.
We propose density and coverage metrics that solve the above issues.
arXiv Detail & Related papers (2020-02-23T00:50:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.