Information-Theoretic Measures of Dataset Difficulty
- URL: http://arxiv.org/abs/2110.08420v1
- Date: Sat, 16 Oct 2021 00:21:42 GMT
- Title: Information-Theoretic Measures of Dataset Difficulty
- Authors: Kawin Ethayarajh, Yejin Choi, Swabha Swayamdipta
- Abstract summary: Estimating difficulty of a dataset typically involves comparing state-of-the-art models to humans.
We propose an information-theoretic perspective, framing dataset difficulty as the absence of usable information.
- Score: 54.538766940287864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Estimating the difficulty of a dataset typically involves comparing
state-of-the-art models to humans; the bigger the performance gap, the harder
the dataset is said to be. Not only is this framework informal, but it also
provides little understanding of how difficult each instance is, or what
attributes make it difficult for a given model. To address these problems, we
propose an information-theoretic perspective, framing dataset difficulty as the
absence of $\textit{usable information}$. Measuring usable information is as
easy as measuring performance, but has certain theoretical advantages. While
the latter only allows us to compare different models w.r.t the same dataset,
the former also allows us to compare different datasets w.r.t the same model.
We then introduce $\textit{pointwise}$ $\mathcal{V}-$$\textit{information}$
(PVI) for measuring the difficulty of individual instances, where instances
with higher PVI are easier for model $\mathcal{V}$. By manipulating the input
before measuring usable information, we can understand $\textit{why}$ a dataset
is easy or difficult for a given model, which we use to discover annotation
artefacts in widely-used benchmarks.
Related papers
- Making Multi-Axis Gaussian Graphical Models Scalable to Millions of Samples and Features [0.30723404270319693]
We introduce a method that has $O(n2)$ runtime and $O(n)$ space complexity, without assuming independence.
We demonstrate that our approach can be used on unprecedentedly large datasets, such as a real-world 1,000,000-cell scRNA-seq dataset.
arXiv Detail & Related papers (2024-07-29T11:15:25Z) - $\textit{GeoHard}$: Towards Measuring Class-wise Hardness through Modelling Class Semantics [90.9047957137981]
This work formally initiates the concept of $textitclass-wise hardness$.
Experiments across eight natural language understanding (NLU) datasets demonstrate a consistent hardness distribution across learning paradigms, models, and human judgment.
$textitGeoHard$ surpasses instance-level metrics by over 59 percent on $textitPearson$'s correlation on measuring class-wise hardness.
arXiv Detail & Related papers (2024-07-17T11:53:39Z) - The Unreasonable Effectiveness of Easy Training Data for Hard Tasks [84.30018805150607]
We present the surprising conclusion that current pretrained language models often generalize relatively well from easy to hard data.
We demonstrate this kind of easy-to-hard generalization using simple finetuning methods like in-context learning, linear heads, and QLoRA.
We conclude that easy-to-hard generalization in LMs is surprisingly strong for the tasks studied.
arXiv Detail & Related papers (2024-01-12T18:36:29Z) - Data Factors for Better Compositional Generalization [60.698130703909804]
We conduct an empirical analysis by training Transformer models on a variety of training sets with different data factors.
We show that increased dataset complexity can lead to better generalization behavior on multiple different generalization challenges.
We explore how training examples of different difficulty levels influence generalization differently.
arXiv Detail & Related papers (2023-11-08T01:27:34Z) - Simplicity Bias Leads to Amplified Performance Disparities [8.60453031364566]
We show that SGD-trained models have a bias towards simplicity, leading them to prioritize learning a majority class.
A model may prioritize any class or group of the dataset that it finds simple-at the expense of what it finds complex.
arXiv Detail & Related papers (2022-12-13T15:24:41Z) - PyHard: a novel tool for generating hardness embeddings to support
data-centric analysis [0.38233569758620045]
PyHard produces a hardness embedding of a dataset relating the predictive performance of multiple ML models.
The user can interact with this embedding in multiple ways to obtain useful insights about data and algorithmic performance.
We show in a COVID prognosis dataset how this analysis supported the identification of pockets of hard observations that challenge ML models.
arXiv Detail & Related papers (2021-09-29T14:08:26Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Evaluating Models' Local Decision Boundaries via Contrast Sets [119.38387782979474]
We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data.
We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets.
Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets.
arXiv Detail & Related papers (2020-04-06T14:47:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.