We Need to Talk About Random Splits
- URL: http://arxiv.org/abs/2005.00636v3
- Date: Mon, 26 Apr 2021 12:05:35 GMT
- Title: We Need to Talk About Random Splits
- Authors: Anders S{\o}gaard and Sebastian Ebert and Jasmijn Bastings and Katja
Filippova
- Abstract summary: Gorman and Bedrick argued for using random splits rather than standard splits in NLP experiments.
We argue that random splits, like standard splits, lead to overly optimistic performance estimates.
- Score: 3.236124102160291
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gorman and Bedrick (2019) argued for using random splits rather than standard
splits in NLP experiments. We argue that random splits, like standard splits,
lead to overly optimistic performance estimates. We can also split data in
biased or adversarial ways, e.g., training on short sentences and evaluating on
long ones. Biased sampling has been used in domain adaptation to simulate
real-world drift; this is known as the covariate shift assumption. In NLP,
however, even worst-case splits, maximizing bias, often under-estimate the
error observed on new samples of in-domain data, i.e., the data that models
should minimally generalize to at test time. This invalidates the covariate
shift assumption. Instead of using multiple random splits, future benchmarks
should ideally include multiple, independent test sets instead; if infeasible,
we argue that multiple biased splits leads to more realistic performance
estimates than multiple random splits.
Related papers
- Probabilistic Contrastive Learning for Long-Tailed Visual Recognition [78.70453964041718]
Longtailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples.
Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance.
We propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space.
arXiv Detail & Related papers (2024-03-11T13:44:49Z) - It's about Time: Rethinking Evaluation on Rumor Detection Benchmarks
using Chronological Splits [27.061515030101972]
We provide a re-evaluation of classification models on four popular rumor detection benchmarks considering chronological instead of random splits.
Our experimental results show that the use of random splits can significantly overestimate predictive performance across all datasets and models.
We suggest that rumor detection models should always be evaluated using chronological splits for minimizing topical overlaps.
arXiv Detail & Related papers (2023-02-06T22:53:13Z) - Benchmarking Long-tail Generalization with Likelihood Splits [20.47194488430863]
We propose a method to create challenging benchmarks that require generalizing to the tail of the distribution by re-splitting existing datasets.
We create 'Likelihood Splits' where examples that are assigned lower likelihood by a pre-trained language model are placed in the test set, and more likely examples are in the training set.
arXiv Detail & Related papers (2022-10-13T07:27:14Z) - Bias Mimicking: A Simple Sampling Approach for Bias Mitigation [57.17709477668213]
We introduce a new class-conditioned sampling method: Bias Mimicking.
Bias Mimicking improves underrepresented groups' accuracy of sampling methods by 3% over four benchmarks.
arXiv Detail & Related papers (2022-09-30T17:33:00Z) - Learning to Split for Automatic Bias Detection [39.353850990332525]
Learning to Split (ls) is an algorithm for automatic bias detection.
We evaluate our approach on Beer Review, CelebA and MNLI.
arXiv Detail & Related papers (2022-04-28T19:41:08Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Examining and Combating Spurious Features under Distribution Shift [94.31956965507085]
We define and analyze robust and spurious representations using the information-theoretic concept of minimal sufficient statistics.
We prove that even when there is only bias of the input distribution, models can still pick up spurious features from their training data.
Inspired by our analysis, we demonstrate that group DRO can fail when groups do not directly account for various spurious correlations.
arXiv Detail & Related papers (2021-06-14T05:39:09Z) - Optimization Variance: Exploring Generalization Properties of DNNs [83.78477167211315]
The test error of a deep neural network (DNN) often demonstrates double descent.
We propose a novel metric, optimization variance (OV), to measure the diversity of model updates.
arXiv Detail & Related papers (2021-06-03T09:34:17Z) - Significance tests of feature relevance for a blackbox learner [6.72450543613463]
We derive two consistent tests for the feature relevance of a blackbox learner.
The first evaluates a loss difference with perturbation on an inference sample.
The second splits the inference sample into two but does not require data perturbation.
arXiv Detail & Related papers (2021-03-02T00:59:19Z) - Individual Calibration with Randomized Forecasting [116.2086707626651]
We show that calibration for individual samples is possible in the regression setup if the predictions are randomized.
We design a training objective to enforce individual calibration and use it to train randomized regression functions.
arXiv Detail & Related papers (2020-06-18T05:53:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.