Dataset Cartography: Mapping and Diagnosing Datasets with Training
Dynamics
- URL: http://arxiv.org/abs/2009.10795v2
- Date: Thu, 15 Oct 2020 05:53:46 GMT
- Title: Dataset Cartography: Mapping and Diagnosing Datasets with Training
Dynamics
- Authors: Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang,
Hannaneh Hajishirzi, Noah A. Smith, Yejin Choi
- Abstract summary: We introduce Data Maps, a model-based tool to characterize and diagnose datasets.
We leverage a largely ignored source of information: the behavior of the model on individual instances during training.
Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.
- Score: 118.75207687144817
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large datasets have become commonplace in NLP research. However, the
increased emphasis on data quantity has made it challenging to assess the
quality of data. We introduce Data Maps---a model-based tool to characterize
and diagnose datasets. We leverage a largely ignored source of information: the
behavior of the model on individual instances during training (training
dynamics) for building data maps. This yields two intuitive measures for each
example---the model's confidence in the true class, and the variability of this
confidence across epochs---obtained in a single run of training. Experiments
across four datasets show that these model-dependent measures reveal three
distinct regions in the data map, each with pronounced characteristics. First,
our data maps show the presence of "ambiguous" regions with respect to the
model, which contribute the most towards out-of-distribution generalization.
Second, the most populous regions in the data are "easy to learn" for the
model, and play an important role in model optimization. Finally, data maps
uncover a region with instances that the model finds "hard to learn"; these
often correspond to labeling errors. Our results indicate that a shift in focus
from quantity to quality of data could lead to robust models and improved
out-of-distribution generalization.
Related papers
- Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining.
We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure.
This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z) - Scaling Laws for the Value of Individual Data Points in Machine Learning [55.596413470429475]
We introduce a new perspective by investigating scaling behavior for the value of individual data points.
We provide learning theory to support our scaling law, and we observe empirically that it holds across diverse model classes.
Our work represents a first step towards understanding and utilizing scaling properties for the value of individual data points.
arXiv Detail & Related papers (2024-05-30T20:10:24Z) - Automated Text Identification Using CNN and Training Dynamics [0.0]
We characterized the samples across 3 dimensions: confidence, variability and correctness.
This shows the presence of 3 regions: easy-to-learn, ambiguous and hard-to-learn examples.
We found that training the model only on a subset of ambiguous examples improves the model's out-of-distribution generalization.
arXiv Detail & Related papers (2024-05-18T07:37:17Z) - Model Selection with Model Zoo via Graph Learning [45.30615308692713]
We introduce TransferGraph, a novel framework that reformulates model selection as a graph learning problem.
We demonstrate TransferGraph's effectiveness in capturing essential model-dataset relationships, yielding up to a 32% improvement in correlation between predicted performance and the actual fine-tuning results compared to the state-of-the-art methods.
arXiv Detail & Related papers (2024-04-05T09:50:00Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - CHALLENGER: Training with Attribution Maps [63.736435657236505]
We show that utilizing attribution maps for training neural networks can improve regularization of models and thus increase performance.
In particular, we show that our generic domain-independent approach yields state-of-the-art results in vision, natural language processing and on time series tasks.
arXiv Detail & Related papers (2022-05-30T13:34:46Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - On the Composition and Limitations of Publicly Available COVID-19 X-Ray
Imaging Datasets [0.0]
Data scarcity, mismatch between training and target population, group imbalance, and lack of documentation are important sources of bias.
This paper presents an overview of the currently public available COVID-19 chest X-ray datasets.
arXiv Detail & Related papers (2020-08-26T14:16:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.