SoK: Memorisation in machine learning
- URL: http://arxiv.org/abs/2311.03075v1
- Date: Mon, 6 Nov 2023 12:59:18 GMT
- Title: SoK: Memorisation in machine learning
- Authors: Dmitrii Usynin, Moritz Knolle, Georgios Kaissis
- Abstract summary: Quantifying the impact of individual data samples on machine learning models is an open research problem.
In this work we unify a broad range of previous definitions and perspectives on memorisation in ML.
We discuss their interplay with model generalisation and their implications of these phenomena on data privacy.
- Score: 5.563171090433323
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Quantifying the impact of individual data samples on machine learning models
is an open research problem. This is particularly relevant when complex and
high-dimensional relationships have to be learned from a limited sample of the
data generating distribution, such as in deep learning. It was previously shown
that, in these cases, models rely not only on extracting patterns which are
helpful for generalisation, but also seem to be required to incorporate some of
the training data more or less as is, in a process often termed memorisation.
This raises the question: if some memorisation is a requirement for effective
learning, what are its privacy implications? In this work we unify a broad
range of previous definitions and perspectives on memorisation in ML, discuss
their interplay with model generalisation and their implications of these
phenomena on data privacy. Moreover, we systematise methods allowing
practitioners to detect the occurrence of memorisation or quantify it and
contextualise our findings in a broad range of ML learning settings. Finally,
we discuss memorisation in the context of privacy attacks, differential privacy
(DP) and adversarial actors.
Related papers
- A Geometric Framework for Understanding Memorization in Generative Models [11.263296715798374]
Recent work has shown that deep generative models can be capable of memorizing and reproducing training datapoints when deployed.
These findings call into question the usability of generative models, especially in light of the legal and privacy risks brought about by memorization.
We propose the manifold memorization hypothesis (MMH), a geometric framework which leverages the manifold hypothesis into a clear language in which to reason about memorization.
arXiv Detail & Related papers (2024-10-31T18:09:01Z) - Extracting Training Data from Document-Based VQA Models [67.1470112451617]
Vision-Language Models (VLMs) have made remarkable progress in document-based Visual Question Answering (i.e., responding to queries about the contents of an input document provided as an image)
We show these models can memorise responses for training samples and regurgitate them even when the relevant visual information has been removed.
This includes Personal Identifiable Information repeated once in the training set, indicating these models could divulge sensitive information and therefore pose a privacy risk.
arXiv Detail & Related papers (2024-07-11T17:44:41Z) - Causal Estimation of Memorisation Profiles [58.20086589761273]
Understanding memorisation in language models has practical and societal implications.
Memorisation is the causal effect of training with an instance on the model's ability to predict that instance.
This paper proposes a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics.
arXiv Detail & Related papers (2024-06-06T17:59:09Z) - Assessing Privacy Risks in Language Models: A Case Study on
Summarization Tasks [65.21536453075275]
We focus on the summarization task and investigate the membership inference (MI) attack.
We exploit text similarity and the model's resistance to document modifications as potential MI signals.
We discuss several safeguards for training summarization models to protect against MI attacks and discuss the inherent trade-off between privacy and utility.
arXiv Detail & Related papers (2023-10-20T05:44:39Z) - Exploring Memorization in Fine-tuned Language Models [53.52403444655213]
We conduct the first comprehensive analysis to explore language models' memorization during fine-tuning across tasks.
Our studies with open-sourced and our own fine-tuned LMs across various tasks indicate that memorization presents a strong disparity among different fine-tuning tasks.
We provide an intuitive explanation of this task disparity via sparse coding theory and unveil a strong correlation between memorization and attention score distribution.
arXiv Detail & Related papers (2023-10-10T15:41:26Z) - On Inductive Biases for Machine Learning in Data Constrained Settings [0.0]
This thesis explores a different answer to the problem of learning expressive models in data constrained settings.
Instead of relying on big datasets to learn neural networks, we will replace some modules by known functions reflecting the structure of the data.
Our approach falls under the hood of "inductive biases", which can be defined as hypothesis on the data at hand restricting the space of models to explore.
arXiv Detail & Related papers (2023-02-21T14:22:01Z) - On the Privacy Effect of Data Enhancement via the Lens of Memorization [20.63044895680223]
We propose to investigate privacy from a new perspective called memorization.
Through the lens of memorization, we find that previously deployed MIAs produce misleading results as they are less likely to identify samples with higher privacy risks.
We demonstrate that the generalization gap and privacy leakage are less correlated than those of the previous results.
arXiv Detail & Related papers (2022-08-17T13:02:17Z) - Towards Differential Relational Privacy and its use in Question
Answering [109.4452196071872]
Memorization of relation between entities in a dataset can lead to privacy issues when using a trained question answering model.
We quantify this phenomenon and provide a possible definition of Differential Privacy (DPRP)
We illustrate concepts in experiments with largescale models for Question Answering.
arXiv Detail & Related papers (2022-03-30T22:59:24Z) - Quantifying and Mitigating Privacy Risks of Contrastive Learning [4.909548818641602]
We perform the first privacy analysis of contrastive learning through the lens of membership inference and attribute inference.
Our results show that contrastive models are less vulnerable to membership inference attacks but more vulnerable to attribute inference attacks compared to supervised models.
To remedy this situation, we propose the first privacy-preserving contrastive learning mechanism, namely Talos.
arXiv Detail & Related papers (2021-02-08T11:38:11Z) - When is Memorization of Irrelevant Training Data Necessary for
High-Accuracy Learning? [53.523017945443115]
We describe natural prediction problems in which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples.
Our results do not depend on the training algorithm or the class of models used for learning.
arXiv Detail & Related papers (2020-12-11T15:25:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.