When is Memorization of Irrelevant Training Data Necessary for
High-Accuracy Learning?
- URL: http://arxiv.org/abs/2012.06421v1
- Date: Fri, 11 Dec 2020 15:25:14 GMT
- Title: When is Memorization of Irrelevant Training Data Necessary for
High-Accuracy Learning?
- Authors: Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, Kunal Talwar
- Abstract summary: We describe natural prediction problems in which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples.
Our results do not depend on the training algorithm or the class of models used for learning.
- Score: 53.523017945443115
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern machine learning models are complex and frequently encode surprising
amounts of information about individual inputs. In extreme cases, complex
models appear to memorize entire input examples, including seemingly irrelevant
information (social security numbers from text, for example). In this paper, we
aim to understand whether this sort of memorization is necessary for accurate
learning. We describe natural prediction problems in which every sufficiently
accurate training algorithm must encode, in the prediction model, essentially
all the information about a large subset of its training examples. This remains
true even when the examples are high-dimensional and have entropy much higher
than the sample size, and even when most of that information is ultimately
irrelevant to the task at hand. Further, our results do not depend on the
training algorithm or the class of models used for learning.
Our problems are simple and fairly natural variants of the next-symbol
prediction and the cluster labeling tasks. These tasks can be seen as
abstractions of image- and text-related prediction problems. To establish our
results, we reduce from a family of one-way communication problems for which we
prove new information complexity lower bounds.
Related papers
- Robust Machine Learning by Transforming and Augmenting Imperfect
Training Data [6.928276018602774]
This thesis explores several data sensitivities of modern machine learning.
We first discuss how to prevent ML from codifying prior human discrimination measured in the training data.
We then discuss the problem of learning from data containing spurious features, which provide predictive fidelity during training but are unreliable upon deployment.
arXiv Detail & Related papers (2023-12-19T20:49:28Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - Ticketed Learning-Unlearning Schemes [57.89421552780526]
We propose a new ticketed model for learning--unlearning.
We provide space-efficient ticketed learning--unlearning schemes for a broad family of concept classes.
arXiv Detail & Related papers (2023-06-27T18:54:40Z) - The No Free Lunch Theorem, Kolmogorov Complexity, and the Role of Inductive Biases in Machine Learning [80.1018596899899]
We argue that neural network models share this same preference, formalized using Kolmogorov complexity.
Our experiments show that pre-trained and even randomly language models prefer to generate low-complexity sequences.
These observations justify the trend in deep learning of unifying seemingly disparate problems with an increasingly small set of machine learning models.
arXiv Detail & Related papers (2023-04-11T17:22:22Z) - On Inductive Biases for Machine Learning in Data Constrained Settings [0.0]
This thesis explores a different answer to the problem of learning expressive models in data constrained settings.
Instead of relying on big datasets to learn neural networks, we will replace some modules by known functions reflecting the structure of the data.
Our approach falls under the hood of "inductive biases", which can be defined as hypothesis on the data at hand restricting the space of models to explore.
arXiv Detail & Related papers (2023-02-21T14:22:01Z) - Small Language Models for Tabular Data [0.0]
We show the ability of deep representation learning to address problems of classification and regression from small and poorly formed datasets.
We find that small models have sufficient capacity for approximation of various functions and achieve record classification benchmark accuracy.
arXiv Detail & Related papers (2022-11-05T16:57:55Z) - A Survey of Learning on Small Data: Generalization, Optimization, and
Challenge [101.27154181792567]
Learning on small data that approximates the generalization ability of big data is one of the ultimate purposes of AI.
This survey follows the active sampling theory under a PAC framework to analyze the generalization error and label complexity of learning on small data.
Multiple data applications that may benefit from efficient small data representation are surveyed.
arXiv Detail & Related papers (2022-07-29T02:34:19Z) - Learning from Few Examples: A Summary of Approaches to Few-Shot Learning [3.6930948691311016]
Few-Shot Learning refers to the problem of learning the underlying pattern in the data just from a few training samples.
Deep learning solutions suffer from data hunger and extensively high computation time and resources.
Few-shot learning that could drastically reduce the turnaround time of building machine learning applications emerges as a low-cost solution.
arXiv Detail & Related papers (2022-03-07T23:15:21Z) - Quantifying Memorization Across Neural Language Models [61.58529162310382]
Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized data verbatim.
This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others).
We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data.
arXiv Detail & Related papers (2022-02-15T18:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.