Retracing the Past: LLMs Emit Training Data When They Get Lost
- URL: http://arxiv.org/abs/2511.05518v1
- Date: Mon, 27 Oct 2025 03:48:24 GMT
- Title: Retracing the Past: LLMs Emit Training Data When They Get Lost
- Authors: Myeongseob Ko, Nikhil Reddy Billa, Adam Nguyen, Charles Fleming, Ming Jin, Ruoxi Jia,
- Abstract summary: memorization of training data in large language models poses significant privacy and copyright concerns.<n>This paper introduces Confusion-Inducing Attacks (CIA), a principled framework for extracting memorized data.
- Score: 18.852558767604823
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The memorization of training data in large language models (LLMs) poses significant privacy and copyright concerns. Existing data extraction methods, particularly heuristic-based divergence attacks, often exhibit limited success and offer limited insight into the fundamental drivers of memorization leakage. This paper introduces Confusion-Inducing Attacks (CIA), a principled framework for extracting memorized data by systematically maximizing model uncertainty. We empirically demonstrate that the emission of memorized text during divergence is preceded by a sustained spike in token-level prediction entropy. CIA leverages this insight by optimizing input snippets to deliberately induce this consecutive high-entropy state. For aligned LLMs, we further propose Mismatched Supervised Fine-tuning (SFT) to simultaneously weaken their alignment and induce targeted confusion, thereby increasing susceptibility to our attacks. Experiments on various unaligned and aligned LLMs demonstrate that our proposed attacks outperform existing baselines in extracting verbatim and near-verbatim training data without requiring prior knowledge of the training data. Our findings highlight persistent memorization risks across various LLMs and offer a more systematic method for assessing these vulnerabilities.
Related papers
- Closing the Distribution Gap in Adversarial Training for LLMs [50.33186122381395]
Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries.<n>We argue that current adversarial training algorithms minimize adversarial loss on their training set but inadequately cover the data distribution, resulting in vulnerability to seemingly simple attacks.<n>We propose Distributional Adversarial Training, DAT, to approximate the true joint distribution of prompts and responses.
arXiv Detail & Related papers (2026-02-16T22:34:52Z) - Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms [3.648393062009244]
Large language models (LLMs) are increasingly trained on massive, heterogeneous text corpora.<n>This raises serious concerns about the unauthorised use of proprietary or personal data during model training.<n>We propose Disclaimer Injection, a novel data-level defence that renders text unlearnable to LLMs.
arXiv Detail & Related papers (2026-01-06T20:34:15Z) - SPEAR++: Scaling Gradient Inversion via Sparsely-Used Dictionary Learning [48.41770886055744]
Federated Learning has seen an increased deployment in real-world scenarios recently.<n>The introduction of the so-called gradient inversion attacks has challenged its privacy-preserving properties.<n>We introduce SPEAR, which is based on a theoretical analysis of the gradients of linear layers with ReLU activations.<n>Our new attack, SPEAR++, retains all desirable properties of SPEAR, such as robustness to DP noise and FedAvg aggregation.
arXiv Detail & Related papers (2025-10-28T09:06:19Z) - LLM Unlearning on Noisy Forget Sets: A Study of Incomplete, Rewritten, and Watermarked Data [69.5099112089508]
Large language models (LLMs) exhibit remarkable generative capabilities but raise ethical and security concerns by memorizing sensitive data.<n>This work presents the first study of unlearning under perturbed or low-fidelity forget data, referred to as noisy forget sets.<n>We find that unlearning remains surprisingly robust to perturbations, provided that core semantic signals are preserved.
arXiv Detail & Related papers (2025-10-10T05:10:49Z) - Unlocking Memorization in Large Language Models with Dynamic Soft Prompting [66.54460367290146]
Large language models (LLMs) have revolutionized natural language processing (NLP) tasks such as summarization, question answering, and translation.
LLMs pose significant security risks due to their tendency to memorize training data, leading to potential privacy breaches and copyright infringement.
We propose a novel method for estimating LLM memorization using dynamic, prefix-dependent soft prompts.
arXiv Detail & Related papers (2024-09-20T18:56:32Z) - Detecting, Explaining, and Mitigating Memorization in Diffusion Models [49.438362005962375]
We introduce a straightforward yet effective method for detecting memorized prompts by inspecting the magnitude of text-conditional predictions.
Our proposed method seamlessly integrates without disrupting sampling algorithms, and delivers high accuracy even at the first generation step.
Building on our detection strategy, we unveil an explainable approach that shows the contribution of individual words or tokens to memorization.
arXiv Detail & Related papers (2024-07-31T16:13:29Z) - Protecting Privacy Through Approximating Optimal Parameters for Sequence Unlearning in Language Models [37.172662930947446]
Language models (LMs) are potentially vulnerable to extraction attacks, which represent a significant privacy risk.
We propose Privacy Protection via Optimal Parameters (POP), a novel unlearning method that effectively forgets the target token sequences from the pretrained LM.
POP exhibits remarkable retention performance post-unlearning across 9 classification and 4 dialogue benchmarks, outperforming the state-of-the-art by a large margin.
arXiv Detail & Related papers (2024-06-20T08:12:49Z) - Extracting Training Data from Unconditional Diffusion Models [76.85077961718875]
diffusion probabilistic models (DPMs) are being employed as mainstream models for generative artificial intelligence (AI)
We aim to establish a theoretical understanding of memorization in DPMs with 1) a memorization metric for theoretical analysis, 2) an analysis of conditional memorization with informative and random labels, and 3) two better evaluation metrics for measuring memorization.
Based on the theoretical analysis, we propose a novel data extraction method called textbfSurrogate condItional Data Extraction (SIDE) that leverages a trained on generated data as a surrogate condition to extract training data directly from unconditional diffusion models.
arXiv Detail & Related papers (2024-06-18T16:20:12Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - Learning to Poison Large Language Models for Downstream Manipulation [12.521338629194503]
This work identifies additional security risks in Large Language Models (LLMs) by designing a new data poisoning attack tailored to exploit the supervised fine-tuning process.<n>We propose a novel gradient-guided backdoor trigger learning (GBTL) algorithm to identify adversarial triggers efficiently.<n>We propose two defense strategies against data poisoning attacks, including in-context learning (ICL) and continuous learning (CL)
arXiv Detail & Related papers (2024-02-21T01:30:03Z) - Amplifying Training Data Exposure through Fine-Tuning with Pseudo-Labeled Memberships [3.544065185401289]
Neural language models (LMs) are vulnerable to training data extraction attacks due to data memorization.
This paper introduces a novel attack scenario wherein an attacker fine-tunes pre-trained LMs to amplify the exposure of the original training data.
Our empirical findings indicate a remarkable outcome: LMs with over 1B parameters exhibit a four to eight-fold increase in training data exposure.
arXiv Detail & Related papers (2024-02-19T14:52:50Z) - Mitigating Approximate Memorization in Language Models via Dissimilarity
Learned Policy [0.0]
Large Language models (LLMs) are trained on large amounts of data.
LLMs showed to memorize parts of the training data and emit those data verbatim when an adversary prompts appropriately.
arXiv Detail & Related papers (2023-05-02T15:53:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.