What can we learn from Data Leakage and Unlearning for Law?
- URL: http://arxiv.org/abs/2307.10476v1
- Date: Wed, 19 Jul 2023 22:14:58 GMT
- Title: What can we learn from Data Leakage and Unlearning for Law?
- Authors: Jaydeep Borkar
- Abstract summary: Large Language Models (LLMs) have a privacy concern because they memorize training data (including personally identifiable information (PII) like emails and phone numbers) and leak it during inference.
In order to comply with privacy laws such as the "right to be forgotten", the data points of users that are most vulnerable to extraction could be deleted.
We show that not only do fine-tuned models leak their training data but they also leak the pre-training data (and PII) memorized during the pre-training phase.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have a privacy concern because they memorize
training data (including personally identifiable information (PII) like emails
and phone numbers) and leak it during inference. A company can train an LLM on
its domain-customized data which can potentially also include their users' PII.
In order to comply with privacy laws such as the "right to be forgotten", the
data points of users that are most vulnerable to extraction could be deleted.
We find that once the most vulnerable points are deleted, a new set of points
become vulnerable to extraction. So far, little attention has been given to
understanding memorization for fine-tuned models. In this work, we also show
that not only do fine-tuned models leak their training data but they also leak
the pre-training data (and PII) memorized during the pre-training phase. The
property of new data points becoming vulnerable to extraction after unlearning
and leakage of pre-training data through fine-tuned models can pose significant
privacy and legal concerns for companies that use LLMs to offer services. We
hope this work will start an interdisciplinary discussion within AI and law
communities regarding the need for policies to tackle these issues.
Related papers
- FT-PrivacyScore: Personalized Privacy Scoring Service for Machine Learning Participation [4.772368796656325]
In practice, controlled data access remains a mainstream method for protecting data privacy in many industrial and research environments.
We developed the demo prototype FT-PrivacyScore to show that it's possible to efficiently and quantitatively estimate the privacy risk of participating in a model fine-tuning task.
arXiv Detail & Related papers (2024-10-30T02:41:26Z) - Federated Learning Privacy: Attacks, Defenses, Applications, and Policy Landscape - A Survey [27.859861825159342]
Deep learning has shown incredible potential across a vast array of tasks.
Recent concerns on privacy have further highlighted challenges for accessing such data.
Federated learning has emerged as an important privacy-preserving technology.
arXiv Detail & Related papers (2024-05-06T16:55:20Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - Second-Order Information Matters: Revisiting Machine Unlearning for Large Language Models [1.443696537295348]
Privacy leakage and copyright violation are still underexplored.
Our unlearning algorithms are not only data-agnostic/model-agnostic but also proven to be robust in terms of utility preservation or privacy guarantee.
arXiv Detail & Related papers (2024-03-13T18:57:30Z) - TOFU: A Task of Fictitious Unlearning for LLMs [99.92305790945507]
Large language models trained on massive corpora of data from the web can reproduce sensitive or private data raising both legal and ethical concerns.
Unlearning, or tuning models to forget information present in their training data, provides us with a way to protect private data after training.
We present TOFU, a benchmark aimed at helping deepen our understanding of unlearning.
arXiv Detail & Related papers (2024-01-11T18:57:12Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - Knowledge Unlearning for Mitigating Privacy Risks in Language Models [31.322818016245087]
We propose knowledge unlearning as an alternative method to reduce privacy risks for language models.
We show that simply applying the unlikelihood training objective to target token sequences is effective at forgetting them.
We show that unlearning can give a stronger empirical privacy guarantee in scenarios where the data vulnerable to extraction attacks are known a priori.
arXiv Detail & Related papers (2022-10-04T10:18:11Z) - Certified Data Removal in Sum-Product Networks [78.27542864367821]
Deleting the collected data is often insufficient to guarantee data privacy.
UnlearnSPN is an algorithm that removes the influence of single data points from a trained sum-product network.
arXiv Detail & Related papers (2022-10-04T08:22:37Z) - Do Gradient Inversion Attacks Make Federated Learning Unsafe? [70.0231254112197]
Federated learning (FL) allows the collaborative training of AI models without needing to share raw data.
Recent works on the inversion of deep neural networks from model gradients raised concerns about the security of FL in preventing the leakage of training data.
In this work, we show that these attacks presented in the literature are impractical in real FL use-cases and provide a new baseline attack.
arXiv Detail & Related papers (2022-02-14T18:33:12Z) - Survey: Leakage and Privacy at Inference Time [59.957056214792665]
Leakage of data from publicly available Machine Learning (ML) models is an area of growing significance.
We focus on inference-time leakage, as the most likely scenario for publicly available models.
We propose a taxonomy across involuntary and malevolent leakage, available defences, followed by the currently available assessment metrics and applications.
arXiv Detail & Related papers (2021-07-04T12:59:16Z) - Security and Privacy Preserving Deep Learning [2.322461721824713]
Massive data collection required for deep learning presents obvious privacy issues.
Users personal, highly sensitive data such as photos and voice recordings are kept indefinitely by the companies that collect it.
Deep neural networks are susceptible to various inference attacks as they remember information about their training data.
arXiv Detail & Related papers (2020-06-23T01:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.