Investigating Data Variance in Evaluations of Automatic Machine
Translation Metrics
- URL: http://arxiv.org/abs/2203.15858v1
- Date: Tue, 29 Mar 2022 18:58:28 GMT
- Title: Investigating Data Variance in Evaluations of Automatic Machine
Translation Metrics
- Authors: Jiannan Xiang, Huayang Li, Yahui Liu, Lemao Liu, Guoping Huang, Defu
Lian, Shuming Shi
- Abstract summary: In this paper, we show that the performances of metrics are sensitive to data.
The ranking of metrics varies when the evaluation is conducted on different datasets.
- Score: 58.50754318846996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current practices in metric evaluation focus on one single dataset, e.g.,
Newstest dataset in each year's WMT Metrics Shared Task. However, in this
paper, we qualitatively and quantitatively show that the performances of
metrics are sensitive to data. The ranking of metrics varies when the
evaluation is conducted on different datasets. Then this paper further
investigates two potential hypotheses, i.e., insignificant data points and the
deviation of Independent and Identically Distributed (i.i.d) assumption, which
may take responsibility for the issue of data variance. In conclusion, our
findings suggest that when evaluating automatic translation metrics,
researchers should take data variance into account and be cautious to claim the
result on a single dataset, because it may leads to inconsistent results with
most of other datasets.
Related papers
- Truthful Dataset Valuation by Pointwise Mutual Information [28.63827288801458]
We propose a new data valuation method that provably guarantees the following: data providers always maximize their expected score by truthfully reporting their observed data.
Our method, following the paradigm of proper scoring rules, measures the pointwise mutual information (PMI) of the test dataset and the evaluated dataset.
arXiv Detail & Related papers (2024-05-28T15:04:17Z) - Benchmark Transparency: Measuring the Impact of Data on Evaluation [6.307485015636125]
We propose an automated framework that measures the data point distribution across 6 different dimensions.
We use disproportional stratified sampling to measure how much the data distribution affects absolute (Acc/F1) and relative (Rank) model performance.
We find that the impact of the data is statistically significant and is often larger than the impact of changing the metric.
arXiv Detail & Related papers (2024-03-31T17:33:43Z) - Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and
Beyond [93.96982273042296]
Vision-language (VL) understanding tasks evaluate models' comprehension of complex visual scenes through multiple-choice questions.
We have identified two dataset biases that models can exploit as shortcuts to resolve various VL tasks correctly without proper understanding.
We propose Adversarial Data Synthesis (ADS) to generate synthetic training and debiased evaluation data.
We then introduce Intra-sample Counterfactual Training (ICT) to assist models in utilizing the synthesized training data, particularly the counterfactual data, via focusing on intra-sample differentiation.
arXiv Detail & Related papers (2023-10-23T08:09:42Z) - Metric Learning Improves the Ability of Combinatorial Coverage Metrics
to Anticipate Classification Error [0.0]
Many machine learning methods are sensitive to test or operational data that is dissimilar to training data.
metric learning is a technique for learning latent spaces where data from different classes is further apart.
In a study of 6 open-source datasets, we find that metric learning increased the difference between set-difference coverage metrics calculated on correctly and incorrectly classified data.
arXiv Detail & Related papers (2023-02-28T14:55:57Z) - Utility Assessment of Synthetic Data Generation Methods [0.0]
We investigate whether different methods of generating fully synthetic data vary in their utility a priori.
We find some methods to perform better than others across the board.
We do get promising findings for classification tasks when using synthetic data for training machine learning models.
arXiv Detail & Related papers (2022-11-23T11:09:52Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Statistical Learning to Operationalize a Domain Agnostic Data Quality
Scoring [8.864453148536061]
The research study provides an automated platform which takes an incoming dataset and metadata to provide the DQ score, report and label.
The results of this study would be useful to data scientists as the value of this quality label would instill confidence before deploying the data for his/her respective practical application.
arXiv Detail & Related papers (2021-08-16T12:20:57Z) - Doing Great at Estimating CATE? On the Neglected Assumptions in
Benchmark Comparisons of Treatment Effect Estimators [91.3755431537592]
We show that even in arguably the simplest setting, estimation under ignorability assumptions can be misleading.
We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators.
We highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others.
arXiv Detail & Related papers (2021-07-28T13:21:27Z) - Multi-Source Causal Inference Using Control Variates [81.57072928775509]
We propose a general algorithm to estimate causal effects from emphmultiple data sources.
We show theoretically that this reduces the variance of the ATE estimate.
We apply this framework to inference from observational data under an outcome selection bias.
arXiv Detail & Related papers (2021-03-30T21:20:51Z) - Towards Understanding Sample Variance in Visually Grounded Language
Generation: Evaluations and Observations [67.4375210552593]
We design experiments to understand an important but often ignored problem in visually grounded language generation.
Given that humans have different utilities and visual attention, how will the sample variance in multi-reference datasets affect the models' performance?
We show that it is of paramount importance to report variance in experiments; that human-generated references could vary drastically in different datasets/tasks, revealing the nature of each task.
arXiv Detail & Related papers (2020-10-07T20:45:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.