Related papers: Investigating Data Variance in Evaluations of Automatic Machine Translation Metrics

Investigating Data Variance in Evaluations of Automatic Machine Translation Metrics

URL: http://arxiv.org/abs/2203.15858v1
Date: Tue, 29 Mar 2022 18:58:28 GMT
Title: Investigating Data Variance in Evaluations of Automatic Machine Translation Metrics
Authors: Jiannan Xiang, Huayang Li, Yahui Liu, Lemao Liu, Guoping Huang, Defu Lian, Shuming Shi
Abstract summary: In this paper, we show that the performances of metrics are sensitive to data. The ranking of metrics varies when the evaluation is conducted on different datasets.
Score: 58.50754318846996
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current practices in metric evaluation focus on one single dataset, e.g., Newstest dataset in each year's WMT Metrics Shared Task. However, in this paper, we qualitatively and quantitatively show that the performances of metrics are sensitive to data. The ranking of metrics varies when the evaluation is conducted on different datasets. Then this paper further investigates two potential hypotheses, i.e., insignificant data points and the deviation of Independent and Identically Distributed (i.i.d) assumption, which may take responsibility for the issue of data variance. In conclusion, our findings suggest that when evaluating automatic translation metrics, researchers should take data variance into account and be cautious to claim the result on a single dataset, because it may leads to inconsistent results with most of other datasets.

Related papers

DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
Data Distribution Valuation [56.71023681599737]
Existing data valuation methods define a value for a discrete dataset. In many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled. We propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies.
arXiv Detail & Related papers (2024-10-06T07:56:53Z)
Uncertainty Quantification of Data Shapley via Statistical Inference [20.35973700939768]
The emergence of data markets underscores the growing importance of data valuation. Within the machine learning landscape, Data Shapley stands out as a widely embraced method for data valuation. This paper establishes the relationship between Data Shapley and infinite-order U-statistics.
arXiv Detail & Related papers (2024-07-28T02:54:27Z)
Evaluating Automatic Metrics with Incremental Machine Translation Systems [55.78547133890403]
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions. We assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations.
arXiv Detail & Related papers (2024-07-03T17:04:17Z)
Truthful Dataset Valuation by Pointwise Mutual Information [28.63827288801458]
We propose a new data valuation method that provably guarantees the following: data providers always maximize their expected score by truthfully reporting their observed data. Our method, following the paradigm of proper scoring rules, measures the pointwise mutual information (PMI) of the test dataset and the evaluated dataset.
arXiv Detail & Related papers (2024-05-28T15:04:17Z)
Benchmark Transparency: Measuring the Impact of Data on Evaluation [6.307485015636125]
We propose an automated framework that measures the data point distribution across 6 different dimensions. We use disproportional stratified sampling to measure how much the data distribution affects absolute (Acc/F1) and relative (Rank) model performance. We find that the impact of the data is statistically significant and is often larger than the impact of changing the metric.
arXiv Detail & Related papers (2024-03-31T17:33:43Z)
Metric Learning Improves the Ability of Combinatorial Coverage Metrics to Anticipate Classification Error [0.0]
Many machine learning methods are sensitive to test or operational data that is dissimilar to training data. metric learning is a technique for learning latent spaces where data from different classes is further apart. In a study of 6 open-source datasets, we find that metric learning increased the difference between set-difference coverage metrics calculated on correctly and incorrectly classified data.
arXiv Detail & Related papers (2023-02-28T14:55:57Z)
Utility Assessment of Synthetic Data Generation Methods [0.0]
We investigate whether different methods of generating fully synthetic data vary in their utility a priori. We find some methods to perform better than others across the board. We do get promising findings for classification tasks when using synthetic data for training machine learning models.
arXiv Detail & Related papers (2022-11-23T11:09:52Z)
Statistical Learning to Operationalize a Domain Agnostic Data Quality Scoring [8.864453148536061]
The research study provides an automated platform which takes an incoming dataset and metadata to provide the DQ score, report and label. The results of this study would be useful to data scientists as the value of this quality label would instill confidence before deploying the data for his/her respective practical application.
arXiv Detail & Related papers (2021-08-16T12:20:57Z)
Doing Great at Estimating CATE? On the Neglected Assumptions in Benchmark Comparisons of Treatment Effect Estimators [91.3755431537592]
We show that even in arguably the simplest setting, estimation under ignorability assumptions can be misleading. We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators. We highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others.
arXiv Detail & Related papers (2021-07-28T13:21:27Z)
Multi-Source Causal Inference Using Control Variates [81.57072928775509]
We propose a general algorithm to estimate causal effects from emphmultiple data sources. We show theoretically that this reduces the variance of the ATE estimate. We apply this framework to inference from observational data under an outcome selection bias.
arXiv Detail & Related papers (2021-03-30T21:20:51Z)
Towards Understanding Sample Variance in Visually Grounded Language Generation: Evaluations and Observations [67.4375210552593]
We design experiments to understand an important but often ignored problem in visually grounded language generation. Given that humans have different utilities and visual attention, how will the sample variance in multi-reference datasets affect the models' performance? We show that it is of paramount importance to report variance in experiments; that human-generated references could vary drastically in different datasets/tasks, revealing the nature of each task.
arXiv Detail & Related papers (2020-10-07T20:45:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.