Machine Learners Should Acknowledge the Legal Implications of Large Language Models as Personal Data
- URL: http://arxiv.org/abs/2503.01630v1
- Date: Mon, 03 Mar 2025 15:05:48 GMT
- Title: Machine Learners Should Acknowledge the Legal Implications of Large Language Models as Personal Data
- Authors: Henrik Nolte, Michèle Finck, Kristof Meding,
- Abstract summary: We show that some machine learning models can thus be considered personal data on their own.<n>This triggers a cascade of data protection implications such as rights to access, rectification, or erasure.<n>This paper argues that machine learning researchers must acknowledge the legal implications of LLMs as personal data throughout the full ML development lifecycle.
- Score: 2.4578723416255754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Does GPT know you? The answer depends on your level of public recognition; however, if your information was available on a website, the answer is probably yes. All Large Language Models (LLMs) memorize training data to some extent. If an LLM training corpus includes personal data, it also memorizes personal data. Developing an LLM typically involves processing personal data, which falls directly within the scope of data protection laws. If a person is identified or identifiable, the implications are far-reaching: the AI system is subject to EU General Data Protection Regulation requirements even after the training phase is concluded. To back our arguments: (1.) We reiterate that LLMs output training data at inference time, be it verbatim or in generalized form. (2.) We show that some LLMs can thus be considered personal data on their own. This triggers a cascade of data protection implications such as data subject rights, including rights to access, rectification, or erasure. These rights extend to the information embedded with-in the AI model. (3.) This paper argues that machine learning researchers must acknowledge the legal implications of LLMs as personal data throughout the full ML development lifecycle, from data collection and curation to model provision on, e.g., GitHub or Hugging Face. (4.) We propose different ways for the ML research community to deal with these legal implications. Our paper serves as a starting point for improving the alignment between data protection law and the technical capabilities of LLMs. Our findings underscore the need for more interaction between the legal domain and the ML community.
Related papers
- Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models [52.439289085318634]
We show how to identify training data known to proprietary large language models (LLMs) by using information-guided probes.
Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes.
arXiv Detail & Related papers (2025-03-15T10:19:15Z) - Scaling Laws for Differentially Private Language Models [53.14592585413073]
Scaling laws have emerged as important components of large language model (LLM) training as they can predict performance gains through scale.<n>LLMs rely on large, high-quality training datasets, like those sourced from (sometimes sensitive) user data.<n>Training models on this sensitive user data requires careful privacy protections like differential privacy (DP)
arXiv Detail & Related papers (2025-01-31T06:32:46Z) - LLM-PBE: Assessing Data Privacy in Large Language Models [111.58198436835036]
Large Language Models (LLMs) have become integral to numerous domains, significantly advancing applications in data management, mining, and analysis.
Despite the critical nature of this issue, there has been no existing literature to offer a comprehensive assessment of data privacy risks in LLMs.
Our paper introduces LLM-PBE, a toolkit crafted specifically for the systematic evaluation of data privacy risks in LLMs.
arXiv Detail & Related papers (2024-08-23T01:37:29Z) - Rethinking LLM Memorization through the Lens of Adversarial Compression [93.13830893086681]
Large language models (LLMs) trained on web-scale datasets raise substantial concerns regarding permissible data usage.
One major question is whether these models "memorize" all their training data or they integrate many data sources in some way more akin to how a human would learn and synthesize information.
We propose the Adversarial Compression Ratio (ACR) as a metric for assessing memorization in LLMs.
arXiv Detail & Related papers (2024-04-23T15:49:37Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - On Protecting the Data Privacy of Large Language Models (LLMs): A Survey [35.48984524483533]
Large language models (LLMs) are complex artificial intelligence systems capable of understanding, generating and translating human language.
LLMs process and generate large amounts of data, which may threaten data privacy.
arXiv Detail & Related papers (2024-03-08T08:47:48Z) - Beyond Memorization: Violating Privacy Via Inference with Large Language Models [2.9373912230684565]
We present the first comprehensive study on the capabilities of pretrained language models to infer personal attributes from text.
Our findings highlight that current LLMs can infer personal data at a previously unattainable scale.
arXiv Detail & Related papers (2023-10-11T08:32:46Z) - SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore [159.21914121143885]
We present SILO, a new language model that manages this risk-performance tradeoff during inference.
SILO is built by (1) training a parametric LM on Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text.
Access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile.
arXiv Detail & Related papers (2023-08-08T17:58:15Z) - What can we learn from Data Leakage and Unlearning for Law? [0.0]
Large Language Models (LLMs) have a privacy concern because they memorize training data (including personally identifiable information (PII) like emails and phone numbers) and leak it during inference.
In order to comply with privacy laws such as the "right to be forgotten", the data points of users that are most vulnerable to extraction could be deleted.
We show that not only do fine-tuned models leak their training data but they also leak the pre-training data (and PII) memorized during the pre-training phase.
arXiv Detail & Related papers (2023-07-19T22:14:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.