Investigating Data Variance in Evaluations of Automatic Machine
Translation Metrics
- URL: http://arxiv.org/abs/2203.15858v1
- Date: Tue, 29 Mar 2022 18:58:28 GMT
- Title: Investigating Data Variance in Evaluations of Automatic Machine
Translation Metrics
- Authors: Jiannan Xiang, Huayang Li, Yahui Liu, Lemao Liu, Guoping Huang, Defu
Lian, Shuming Shi
- Abstract summary: In this paper, we show that the performances of metrics are sensitive to data.
The ranking of metrics varies when the evaluation is conducted on different datasets.
- Score: 58.50754318846996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current practices in metric evaluation focus on one single dataset, e.g.,
Newstest dataset in each year's WMT Metrics Shared Task. However, in this
paper, we qualitatively and quantitatively show that the performances of
metrics are sensitive to data. The ranking of metrics varies when the
evaluation is conducted on different datasets. Then this paper further
investigates two potential hypotheses, i.e., insignificant data points and the
deviation of Independent and Identically Distributed (i.i.d) assumption, which
may take responsibility for the issue of data variance. In conclusion, our
findings suggest that when evaluating automatic translation metrics,
researchers should take data variance into account and be cautious to claim the
result on a single dataset, because it may leads to inconsistent results with
most of other datasets.
Related papers
- Data Distribution Valuation [56.71023681599737]
Existing data valuation methods define a value for a discrete dataset.
In many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled.
We propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies.
arXiv Detail & Related papers (2024-10-06T07:56:53Z) - Uncertainty Quantification of Data Shapley via Statistical Inference [20.35973700939768]
The emergence of data markets underscores the growing importance of data valuation.
Within the machine learning landscape, Data Shapley stands out as a widely embraced method for data valuation.
This paper establishes the relationship between Data Shapley and infinite-order U-statistics.
arXiv Detail & Related papers (2024-07-28T02:54:27Z) - Evaluating Automatic Metrics with Incremental Machine Translation Systems [55.78547133890403]
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions.
We assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations.
arXiv Detail & Related papers (2024-07-03T17:04:17Z) - Truthful Dataset Valuation by Pointwise Mutual Information [28.63827288801458]
We propose a new data valuation method that provably guarantees the following: data providers always maximize their expected score by truthfully reporting their observed data.
Our method, following the paradigm of proper scoring rules, measures the pointwise mutual information (PMI) of the test dataset and the evaluated dataset.
arXiv Detail & Related papers (2024-05-28T15:04:17Z) - Benchmark Transparency: Measuring the Impact of Data on Evaluation [6.307485015636125]
We propose an automated framework that measures the data point distribution across 6 different dimensions.
We use disproportional stratified sampling to measure how much the data distribution affects absolute (Acc/F1) and relative (Rank) model performance.
We find that the impact of the data is statistically significant and is often larger than the impact of changing the metric.
arXiv Detail & Related papers (2024-03-31T17:33:43Z) - Metric Learning Improves the Ability of Combinatorial Coverage Metrics
to Anticipate Classification Error [0.0]
Many machine learning methods are sensitive to test or operational data that is dissimilar to training data.
metric learning is a technique for learning latent spaces where data from different classes is further apart.
In a study of 6 open-source datasets, we find that metric learning increased the difference between set-difference coverage metrics calculated on correctly and incorrectly classified data.
arXiv Detail & Related papers (2023-02-28T14:55:57Z) - Utility Assessment of Synthetic Data Generation Methods [0.0]
We investigate whether different methods of generating fully synthetic data vary in their utility a priori.
We find some methods to perform better than others across the board.
We do get promising findings for classification tasks when using synthetic data for training machine learning models.
arXiv Detail & Related papers (2022-11-23T11:09:52Z) - Doing Great at Estimating CATE? On the Neglected Assumptions in
Benchmark Comparisons of Treatment Effect Estimators [91.3755431537592]
We show that even in arguably the simplest setting, estimation under ignorability assumptions can be misleading.
We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators.
We highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others.
arXiv Detail & Related papers (2021-07-28T13:21:27Z) - Multi-Source Causal Inference Using Control Variates [81.57072928775509]
We propose a general algorithm to estimate causal effects from emphmultiple data sources.
We show theoretically that this reduces the variance of the ATE estimate.
We apply this framework to inference from observational data under an outcome selection bias.
arXiv Detail & Related papers (2021-03-30T21:20:51Z) - Towards Understanding Sample Variance in Visually Grounded Language
Generation: Evaluations and Observations [67.4375210552593]
We design experiments to understand an important but often ignored problem in visually grounded language generation.
Given that humans have different utilities and visual attention, how will the sample variance in multi-reference datasets affect the models' performance?
We show that it is of paramount importance to report variance in experiments; that human-generated references could vary drastically in different datasets/tasks, revealing the nature of each task.
arXiv Detail & Related papers (2020-10-07T20:45:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.