Showing Your Work Doesn't Always Work
- URL: http://arxiv.org/abs/2004.13705v1
- Date: Tue, 28 Apr 2020 17:59:01 GMT
- Title: Showing Your Work Doesn't Always Work
- Authors: Raphael Tang, Jaejun Lee, Ji Xin, Xinyu Liu, Yaoliang Yu, Jimmy Lin
- Abstract summary: "Show Your Work: Improved Reporting of Experimental Results" advocates for reporting the expected validation effectiveness of the best-tuned model.
We analytically show that their estimator is biased and uses error-prone assumptions.
We derive an unbiased alternative and bolster our claims with empirical evidence from statistical simulation.
- Score: 73.63200097493576
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In natural language processing, a recently popular line of work explores how
to best report the experimental results of neural networks. One exemplar
publication, titled "Show Your Work: Improved Reporting of Experimental
Results," advocates for reporting the expected validation effectiveness of the
best-tuned model, with respect to the computational budget. In the present
work, we critically examine this paper. As far as statistical generalizability
is concerned, we find unspoken pitfalls and caveats with this approach. We
analytically show that their estimator is biased and uses error-prone
assumptions. We find that the estimator favors negative errors and yields poor
bootstrapped confidence intervals. We derive an unbiased alternative and
bolster our claims with empirical evidence from statistical simulation. Our
codebase is at http://github.com/castorini/meanmax.
Related papers
- Generalized Regression with Conditional GANs [2.4171019220503402]
We propose to learn a prediction function whose outputs, when paired with the corresponding inputs, are indistinguishable from feature-label pairs in the training dataset.
We show that this approach to regression makes fewer assumptions on the distribution of the data we are fitting to and, therefore, has better representation capabilities.
arXiv Detail & Related papers (2024-04-21T01:27:47Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Expected Validation Performance and Estimation of a Random Variable's
Maximum [48.83713377993604]
We analyze three statistical estimators for expected validation performance.
We find the unbiased estimator has the highest variance, and the estimator with the smallest variance has the largest bias.
We find that the two biased estimators lead to the fewest incorrect conclusions.
arXiv Detail & Related papers (2021-10-01T18:48:47Z) - Bayesian analysis of the prevalence bias: learning and predicting from
imbalanced data [10.659348599372944]
This paper lays the theoretical and computational framework for training models, and for prediction, in the presence of prevalence bias.
It offers an alternative to principled training losses and complements test-time procedures based on selecting an operating point from summary curves.
It integrates seamlessly in the current paradigm of (deep) learning using backpropagation and naturally with Bayesian models.
arXiv Detail & Related papers (2021-07-31T14:36:33Z) - Hidden Biases in Unreliable News Detection Datasets [60.71991809782698]
We show that selection bias during data collection leads to undesired artifacts in the datasets.
We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap.
We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.
arXiv Detail & Related papers (2021-04-20T17:16:41Z) - Double Perturbation: On the Robustness of Robustness and Counterfactual
Bias Evaluation [109.06060143938052]
We propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset.
We apply this framework to study two perturbation-based approaches that are used to analyze models' robustness and counterfactual bias in English.
arXiv Detail & Related papers (2021-04-12T06:57:36Z) - SLOE: A Faster Method for Statistical Inference in High-Dimensional
Logistic Regression [68.66245730450915]
We develop an improved method for debiasing predictions and estimating frequentist uncertainty for practical datasets.
Our main contribution is SLOE, an estimator of the signal strength with convergence guarantees that reduces the computation time of estimation and inference by orders of magnitude.
arXiv Detail & Related papers (2021-03-23T17:48:56Z) - Estimating Treatment Effects with Observed Confounders and Mediators [25.338901482522648]
Given a causal graph, the do-calculus can express treatment effects as functionals of the observational joint distribution that can be estimated empirically.
Sometimes the do-calculus identifies multiple valid formulae, prompting us to compare the statistical properties of the corresponding estimators.
In this paper, we investigate the over-identified scenario where both confounders and mediators are observed, rendering both estimators valid.
arXiv Detail & Related papers (2020-03-26T15:50:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.