Related papers: When is Memorization of Irrelevant Training Data Necessary for High-Accuracy Learning?

When is Memorization of Irrelevant Training Data Necessary for High-Accuracy Learning?

URL: http://arxiv.org/abs/2012.06421v1
Date: Fri, 11 Dec 2020 15:25:14 GMT
Title: When is Memorization of Irrelevant Training Data Necessary for High-Accuracy Learning?
Authors: Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, Kunal Talwar
Abstract summary: We describe natural prediction problems in which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples. Our results do not depend on the training algorithm or the class of models used for learning.
Score: 53.523017945443115
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern machine learning models are complex and frequently encode surprising amounts of information about individual inputs. In extreme cases, complex models appear to memorize entire input examples, including seemingly irrelevant information (social security numbers from text, for example). In this paper, we aim to understand whether this sort of memorization is necessary for accurate learning. We describe natural prediction problems in which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples. This remains true even when the examples are high-dimensional and have entropy much higher than the sample size, and even when most of that information is ultimately irrelevant to the task at hand. Further, our results do not depend on the training algorithm or the class of models used for learning. Our problems are simple and fairly natural variants of the next-symbol prediction and the cluster labeling tasks. These tasks can be seen as abstractions of image- and text-related prediction problems. To establish our results, we reduce from a family of one-way communication problems for which we prove new information complexity lower bounds.

Related papers

Trade-offs in Data Memorization via Strong Data Processing Inequalities [19.969359347811398]
Recent research demonstrated that training large language models involves memorization of a significant fraction of training data.<n>Such memorization can lead to privacy violations when training on sensitive user data and thus motivates the study of data memorization's role in learning.<n>We develop a general approach for proving lower bounds on excess data memorization, that relies on a new connection between strong data processing inequalities and data memorization.
arXiv Detail & Related papers (2025-06-02T16:41:49Z)
Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency. This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z)
Robust Machine Learning by Transforming and Augmenting Imperfect Training Data [6.928276018602774]
This thesis explores several data sensitivities of modern machine learning. We first discuss how to prevent ML from codifying prior human discrimination measured in the training data. We then discuss the problem of learning from data containing spurious features, which provide predictive fidelity during training but are unreliable upon deployment.
arXiv Detail & Related papers (2023-12-19T20:49:28Z)
Ticketed Learning-Unlearning Schemes [57.89421552780526]
We propose a new ticketed model for learning--unlearning. We provide space-efficient ticketed learning--unlearning schemes for a broad family of concept classes.
arXiv Detail & Related papers (2023-06-27T18:54:40Z)
The No Free Lunch Theorem, Kolmogorov Complexity, and the Role of Inductive Biases in Machine Learning [80.1018596899899]
We argue that neural network models share this same preference, formalized using Kolmogorov complexity. Our experiments show that pre-trained and even randomly language models prefer to generate low-complexity sequences. These observations justify the trend in deep learning of unifying seemingly disparate problems with an increasingly small set of machine learning models.
arXiv Detail & Related papers (2023-04-11T17:22:22Z)
On Inductive Biases for Machine Learning in Data Constrained Settings [0.0]
This thesis explores a different answer to the problem of learning expressive models in data constrained settings. Instead of relying on big datasets to learn neural networks, we will replace some modules by known functions reflecting the structure of the data. Our approach falls under the hood of "inductive biases", which can be defined as hypothesis on the data at hand restricting the space of models to explore.
arXiv Detail & Related papers (2023-02-21T14:22:01Z)
Small Language Models for Tabular Data [0.0]
We show the ability of deep representation learning to address problems of classification and regression from small and poorly formed datasets. We find that small models have sufficient capacity for approximation of various functions and achieve record classification benchmark accuracy.
arXiv Detail & Related papers (2022-11-05T16:57:55Z)
A Survey of Learning on Small Data: Generalization, Optimization, and Challenge [101.27154181792567]
Learning on small data that approximates the generalization ability of big data is one of the ultimate purposes of AI. This survey follows the active sampling theory under a PAC framework to analyze the generalization error and label complexity of learning on small data. Multiple data applications that may benefit from efficient small data representation are surveyed.
arXiv Detail & Related papers (2022-07-29T02:34:19Z)
Learning from Few Examples: A Summary of Approaches to Few-Shot Learning [3.6930948691311016]
Few-Shot Learning refers to the problem of learning the underlying pattern in the data just from a few training samples. Deep learning solutions suffer from data hunger and extensively high computation time and resources. Few-shot learning that could drastically reduce the turnaround time of building machine learning applications emerges as a low-cost solution.
arXiv Detail & Related papers (2022-03-07T23:15:21Z)
Quantifying Memorization Across Neural Language Models [61.58529162310382]
Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized data verbatim. This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others). We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data.
arXiv Detail & Related papers (2022-02-15T18:48:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.