Characterizing Datapoints via Second-Split Forgetting
- URL: http://arxiv.org/abs/2210.15031v1
- Date: Wed, 26 Oct 2022 21:03:46 GMT
- Title: Characterizing Datapoints via Second-Split Forgetting
- Authors: Pratyush Maini, Saurabh Garg, Zachary C. Lipton, J. Zico Kolter
- Abstract summary: We propose $$-second-$split$ $forgetting$ $time$ (SSFT), a complementary metric that tracks the epoch (if any) after which an original training example is forgotten.
We demonstrate that $mislabeled$ examples are forgotten quickly, and seemingly $rare$ examples are forgotten comparatively slowly.
SSFT can (i) help to identify mislabeled samples, the removal of which improves generalization; and (ii) provide insights about failure modes.
- Score: 93.99363547536392
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Researchers investigating example hardness have increasingly focused on the
dynamics by which neural networks learn and forget examples throughout
training. Popular metrics derived from these dynamics include (i) the epoch at
which examples are first correctly classified; (ii) the number of times their
predictions flip during training; and (iii) whether their prediction flips if
they are held out. However, these metrics do not distinguish among examples
that are hard for distinct reasons, such as membership in a rare subpopulation,
being mislabeled, or belonging to a complex subpopulation. In this paper, we
propose $second$-$split$ $forgetting$ $time$ (SSFT), a complementary metric
that tracks the epoch (if any) after which an original training example is
forgotten as the network is fine-tuned on a randomly held out partition of the
data. Across multiple benchmark datasets and modalities, we demonstrate that
$mislabeled$ examples are forgotten quickly, and seemingly $rare$ examples are
forgotten comparatively slowly. By contrast, metrics only considering the first
split learning dynamics struggle to differentiate the two. At large learning
rates, SSFT tends to be robust across architectures, optimizers, and random
seeds. From a practical standpoint, the SSFT can (i) help to identify
mislabeled samples, the removal of which improves generalization; and (ii)
provide insights about failure modes. Through theoretical analysis addressing
overparameterized linear models, we provide insights into how the observed
phenomena may arise. Code for reproducing our experiments can be found here:
https://github.com/pratyushmaini/ssft
Related papers
- Enhancing Consistency and Mitigating Bias: A Data Replay Approach for
Incremental Learning [100.7407460674153]
Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks.
To mitigate the problem, a line of methods propose to replay the data of experienced tasks when learning new tasks.
However, it is not expected in practice considering the memory constraint or data privacy issue.
As a replacement, data-free data replay methods are proposed by inverting samples from the classification model.
arXiv Detail & Related papers (2024-01-12T12:51:12Z) - Late Stopping: Avoiding Confidently Learning from Mislabeled Examples [61.00103151680946]
We propose a new framework, Late Stopping, which leverages the intrinsic robust learning ability of DNNs through a prolonged training process.
We empirically observe that mislabeled and clean examples exhibit differences in the number of epochs required for them to be consistently and correctly classified.
Experimental results on benchmark-simulated and real-world noisy datasets demonstrate that the proposed method outperforms state-of-the-art counterparts.
arXiv Detail & Related papers (2023-08-26T12:43:25Z) - Toward Understanding Generative Data Augmentation [16.204251285425478]
We show that generative data augmentation can enjoy a faster learning rate when the order of divergence term is $o(maxleft( log(m)beta_m, 1 / sqrtm)right)$.
We prove that in both cases, though generative data augmentation does not enjoy a faster learning rate, it can improve the learning guarantees at a constant level when the train set is small.
arXiv Detail & Related papers (2023-05-27T13:46:08Z) - Revisiting Discriminative vs. Generative Classifiers: Theory and
Implications [37.98169487351508]
This paper is inspired by the statistical efficiency of naive Bayes.
We present a multiclass $mathcalH$-consistency bound framework and an explicit bound for logistic loss.
Experiments on various pre-trained deep vision models show that naive Bayes consistently converges faster as the number of data increases.
arXiv Detail & Related papers (2023-02-05T08:30:42Z) - Dash: Semi-Supervised Learning with Dynamic Thresholding [72.74339790209531]
We propose a semi-supervised learning (SSL) approach that uses unlabeled examples to train models.
Our proposed approach, Dash, enjoys its adaptivity in terms of unlabeled data selection.
arXiv Detail & Related papers (2021-09-01T23:52:29Z) - When does loss-based prioritization fail? [18.982933391138268]
We show that loss-based acceleration methods degrade in scenarios with noisy and corrupted data.
Measures of example difficulty need to correctly separate out noise from other types of challenging examples.
arXiv Detail & Related papers (2021-07-16T07:23:15Z) - Learning What Makes a Difference from Counterfactual Examples and
Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks.
We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task.
Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z) - Instance Credibility Inference for Few-Shot Learning [45.577880041135785]
Few-shot learning aims to recognize new objects with extremely limited training data for each category.
This paper presents a simple statistical approach, dubbed Instance Credibility Inference (ICI) to exploit the distribution support of unlabeled instances for few-shot learning.
Our simple approach can establish new state-of-the-arts on four widely used few-shot learning benchmark datasets.
arXiv Detail & Related papers (2020-03-26T12:01:15Z) - Robust and On-the-fly Dataset Denoising for Image Classification [72.10311040730815]
On-the-fly Data Denoising (ODD) is robust to mislabeled examples, while introducing almost zero computational overhead compared to standard training.
ODD is able to achieve state-of-the-art results on a wide range of datasets including real-world ones such as WebVision and Clothing1M.
arXiv Detail & Related papers (2020-03-24T03:59:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.