What Neural Networks Memorize and Why: Discovering the Long Tail via
Influence Estimation
- URL: http://arxiv.org/abs/2008.03703v1
- Date: Sun, 9 Aug 2020 10:12:28 GMT
- Title: What Neural Networks Memorize and Why: Discovering the Long Tail via
Influence Estimation
- Authors: Vitaly Feldman and Chiyuan Zhang
- Abstract summary: Deep learning algorithms are well-known to have a propensity for fitting the training data very well.
Such fitting requires memorization of training data labels.
We propose a theoretical explanation for this phenomenon based on a combination of two insights.
- Score: 37.5845376458136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning algorithms are well-known to have a propensity for fitting the
training data very well and often fit even outliers and mislabeled data points.
Such fitting requires memorization of training data labels, a phenomenon that
has attracted significant research interest but has not been given a compelling
explanation so far. A recent work of Feldman (2019) proposes a theoretical
explanation for this phenomenon based on a combination of two insights. First,
natural image and data distributions are (informally) known to be long-tailed,
that is have a significant fraction of rare and atypical examples. Second, in a
simple theoretical model such memorization is necessary for achieving
close-to-optimal generalization error when the data distribution is
long-tailed. However, no direct empirical evidence for this explanation or even
an approach for obtaining such evidence were given.
In this work we design experiments to test the key ideas in this theory. The
experiments require estimation of the influence of each training example on the
accuracy at each test example as well as memorization values of training
examples. Estimating these quantities directly is computationally prohibitive
but we show that closely-related subsampled influence and memorization values
can be estimated much more efficiently. Our experiments demonstrate the
significant benefits of memorization for generalization on several standard
benchmarks. They also provide quantitative and visually compelling evidence for
the theory put forth in (Feldman, 2019).
Related papers
- Why Fine-grained Labels in Pretraining Benefit Generalization? [12.171634061370616]
Recent studies show that pretraining a deep neural network with fine-grained labeled data, followed by fine-tuning on coarse-labeled data, often yields better generalization than pretraining with coarse-labeled data.
This paper addresses this gap by introducing a "hierarchical multi-view" structure to confine the input data distribution.
Under this framework, we prove that: 1) coarse-grained pretraining only allows a neural network to learn the common features well, while 2) fine-grained pretraining helps the network learn the rare features in addition to the common ones, leading to improved accuracy on hard downstream test samples.
arXiv Detail & Related papers (2024-10-30T15:41:30Z) - Causal Estimation of Memorisation Profiles [58.20086589761273]
Understanding memorisation in language models has practical and societal implications.
Memorisation is the causal effect of training with an instance on the model's ability to predict that instance.
This paper proposes a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics.
arXiv Detail & Related papers (2024-06-06T17:59:09Z) - Empirical Design in Reinforcement Learning [23.873958977534993]
It is now common to benchmark agents with millions of parameters against dozens of tasks, each using the equivalent of 30 days of experience.
The scale of these experiments often conflict with the need for proper statistical evidence, especially when comparing algorithms.
This manuscript represents both a call to action, and a comprehensive resource for how to do good experiments in reinforcement learning.
arXiv Detail & Related papers (2023-04-03T19:32:24Z) - Characterizing Datapoints via Second-Split Forgetting [93.99363547536392]
We propose $$-second-$split$ $forgetting$ $time$ (SSFT), a complementary metric that tracks the epoch (if any) after which an original training example is forgotten.
We demonstrate that $mislabeled$ examples are forgotten quickly, and seemingly $rare$ examples are forgotten comparatively slowly.
SSFT can (i) help to identify mislabeled samples, the removal of which improves generalization; and (ii) provide insights about failure modes.
arXiv Detail & Related papers (2022-10-26T21:03:46Z) - An Empirical Study of Memorization in NLP [8.293936347234126]
We use three different NLP tasks to check if the long-tail theory holds.
Experiments demonstrate that top-ranked memorized training instances are likely atypical.
We develop an attribution method to better understand why a training instance is memorized.
arXiv Detail & Related papers (2022-03-23T03:27:56Z) - Impact of Pretraining Term Frequencies on Few-Shot Reasoning [51.990349528930125]
We investigate how well pretrained language models reason with terms that are less frequent in the pretraining data.
We measure the strength of this correlation for a number of GPT-based language models on various numerical deduction tasks.
Although LMs exhibit strong performance at few-shot numerical reasoning tasks, our results raise the question of how much models actually generalize beyond pretraining data.
arXiv Detail & Related papers (2022-02-15T05:43:54Z) - Understanding Memorization from the Perspective of Optimization via
Efficient Influence Estimation [54.899751055620904]
We study the phenomenon of memorization with turn-over dropout, an efficient method to estimate influence and memorization, for data with true labels (real data) and data with random labels (random data)
Our main findings are: (i) For both real data and random data, the optimization of easy examples (e.g., real data) and difficult examples (e.g., random data) are conducted by the network simultaneously, with easy ones at a higher speed; (ii) For real data, a correct difficult example in the training dataset is more informative than an easy one.
arXiv Detail & Related papers (2021-12-16T11:34:23Z) - Deep Learning Through the Lens of Example Difficulty [21.522182447513632]
We introduce a measure of the computational difficulty of making a prediction for a given input: the (effective) prediction depth.
Our investigation reveals surprising yet simple relationships between the prediction depth of a given input and the model's uncertainty, confidence, accuracy and speed of learning for that data point.
arXiv Detail & Related papers (2021-06-17T16:48:12Z) - A Theoretical Analysis of Learning with Noisily Labeled Data [62.946840431501855]
We first show that in the first epoch training, the examples with clean labels will be learned first.
We then show that after the learning from clean data stage, continuously training model can achieve further improvement in testing error.
arXiv Detail & Related papers (2021-04-08T23:40:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.