User Privacy and Large Language Models: An Analysis of Frontier Developers' Privacy Policies
- URL: http://arxiv.org/abs/2509.05382v1
- Date: Fri, 05 Sep 2025 01:01:21 GMT
- Title: User Privacy and Large Language Models: An Analysis of Frontier Developers' Privacy Policies
- Authors: Jennifer King, Kevin Klyman, Emily Capstick, Tiffany Saade, Victoria Hsieh,
- Abstract summary: This paper analyzes the privacy policies of six U.S. frontier AI developers to understand how they use their users' chats to train models.<n>We find that all six developers appear to employ their users' chat data to train and improve their models by default, and that some retain this data indefinitely.<n>Developers' privacy policies often lack essential information about their practices, highlighting the need for greater transparency and accountability.
- Score: 1.59424536577914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hundreds of millions of people now regularly interact with large language models via chatbots. Model developers are eager to acquire new sources of high-quality training data as they race to improve model capabilities and win market share. This paper analyzes the privacy policies of six U.S. frontier AI developers to understand how they use their users' chats to train models. Drawing primarily on the California Consumer Privacy Act, we develop a novel qualitative coding schema that we apply to each developer's relevant privacy policies to compare data collection and use practices across the six companies. We find that all six developers appear to employ their users' chat data to train and improve their models by default, and that some retain this data indefinitely. Developers may collect and train on personal information disclosed in chats, including sensitive information such as biometric and health data, as well as files uploaded by users. Four of the six companies we examined appear to include children's chat data for model training, as well as customer data from other products. On the whole, developers' privacy policies often lack essential information about their practices, highlighting the need for greater transparency and accountability. We address the implications of users' lack of consent for the use of their chat data for model training, data security issues arising from indefinite chat data retention, and training on children's chat data. We conclude by providing recommendations to policymakers and developers to address the data privacy challenges posed by LLM-powered chatbots.
Related papers
- Guarding Your Conversations: Privacy Gatekeepers for Secure Interactions with Cloud-Based AI Models [0.34998703934432673]
We propose the concept of an "LLM gatekeeper", a lightweight, locally run model that filters out sensitive information from user queries before they are sent to the potentially untrustworthy, though highly capable, cloud-based LLM.<n>Through experiments with human subjects, we demonstrate that this dual-model approach introduces minimal overhead while significantly enhancing user privacy, without compromising the quality of LLM responses.
arXiv Detail & Related papers (2025-08-22T19:49:03Z) - Understanding Privacy Norms Around LLM-Based Chatbots: A Contextual Integrity Perspective [14.179623604712065]
We conduct a survey experiment with 300 US ChatGPT users to understand emerging privacy norms for sharing ChatGPT data.<n>Our findings reveal a stark disconnect between user concerns and behavior.<n>Participants uniformly rejected sharing personal data for improved services, even in exchange for premium features worth $200.
arXiv Detail & Related papers (2025-08-09T00:22:46Z) - Controlling What You Share: Assessing Language Model Adherence to Privacy Preferences [80.63946798650653]
We explore how users can stay in control of their data by using privacy profiles.<n>We build a framework where a local model uses these instructions to rewrite queries.<n>To support this research, we introduce a multilingual dataset of real user queries to mark private content.
arXiv Detail & Related papers (2025-07-07T18:22:55Z) - Are LLM-based methods good enough for detecting unfair terms of service? [67.49487557224415]
Large language models (LLMs) are good at parsing long text-based documents.
We build a dataset consisting of 12 questions applied individually to a set of privacy policies.
Some open-source models are able to provide a higher accuracy compared to some commercial models.
arXiv Detail & Related papers (2024-08-24T09:26:59Z) - NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human [56.46355425175232]
We suggest sanitizing sensitive text using two common strategies used by humans.<n>We curate the first corpus, coined NAP2, through both crowdsourcing and the use of large language models.<n>Compared to the prior works on anonymization, the human-inspired approaches result in more natural rewrites.
arXiv Detail & Related papers (2024-06-06T05:07:44Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - Protecting User Privacy in Online Settings via Supervised Learning [69.38374877559423]
We design an intelligent approach to online privacy protection that leverages supervised learning.
By detecting and blocking data collection that might infringe on a user's privacy, we can restore a degree of digital privacy to the user.
arXiv Detail & Related papers (2023-04-06T05:20:16Z) - FedBot: Enhancing Privacy in Chatbots with Federated Learning [0.0]
Federated Learning (FL) aims to protect data privacy through distributed learning methods that keep the data in its location.
The POC combines Deep Bidirectional Transformer models and federated learning algorithms to protect customer data privacy during collaborative model training.
The system is specifically designed to improve its performance and accuracy over time by leveraging its ability to learn from previous interactions.
arXiv Detail & Related papers (2023-04-04T23:13:52Z) - Certified Data Removal in Sum-Product Networks [78.27542864367821]
Deleting the collected data is often insufficient to guarantee data privacy.
UnlearnSPN is an algorithm that removes the influence of single data points from a trained sum-product network.
arXiv Detail & Related papers (2022-10-04T08:22:37Z) - You Are What You Write: Preserving Privacy in the Era of Large Language
Models [2.3431670397288005]
We present an empirical investigation into the extent of the personal information encoded into pre-trained representations by a range of popular models.
We show a positive correlation between the complexity of a model, the amount of data used in pre-training, and data leakage.
arXiv Detail & Related papers (2022-04-20T11:12:53Z) - Security and Privacy Preserving Deep Learning [2.322461721824713]
Massive data collection required for deep learning presents obvious privacy issues.
Users personal, highly sensitive data such as photos and voice recordings are kept indefinitely by the companies that collect it.
Deep neural networks are susceptible to various inference attacks as they remember information about their training data.
arXiv Detail & Related papers (2020-06-23T01:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.