Related papers: Learning to Detect Language Model Training Data via Active Reconstruction

Learning to Detect Language Model Training Data via Active Reconstruction

URL: http://arxiv.org/abs/2602.19020v1
Date: Sun, 22 Feb 2026 03:20:06 GMT
Title: Learning to Detect Language Model Training Data via Active Reconstruction
Authors: Junjie Oscar Yin, John X. Morris, Vitaly Shmatikov, Sewon Min, Hannaneh Hajishirzi,
Abstract summary: We introduce textbfActive Data Reconstruction Attack (ADRA)<n>ADRA induces a model to reconstruct a given text through training.<n>Our algorithms consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data.
Score: 65.4791582049743
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce \textbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are \textit{more reconstructible} than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, \textsc{ADRA} and its adaptive variant \textsc{ADRA+}, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.

Related papers

What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest. Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z)
Detecting Training Data of Large Language Models via Expectation Maximization [62.28028046993391]
We introduce EM-MIA, a novel membership inference method that iteratively refines membership scores and prefix scores via an expectation-maximization algorithm.<n> EM-MIA achieves state-of-the-art results on WikiMIA.
arXiv Detail & Related papers (2024-10-10T03:31:16Z)
Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs [61.04246774006429]
We introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent.<n>We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements.<n>Our findings show that instruction-tuned models can expose pre-training data as much as their base-models, if not more so, and using instructions proposed by other LLMs can open a new avenue of automated attacks.
arXiv Detail & Related papers (2024-03-05T19:32:01Z)
Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models [4.081098869497239]
We develop state-of-the-art privacy attacks against Large Language Models (LLMs) New membership inference attacks (MIAs) against pretrained LLMs perform hundreds of times better than baseline attacks. In fine-tuning, we find that a simple attack based on the ratio of the loss between the base and fine-tuned models is able to achieve near-perfect MIA performance.
arXiv Detail & Related papers (2024-02-26T20:41:50Z)
Reconstructing Training Data with Informed Adversaries [30.138217209991826]
Given access to a machine learning model, can an adversary reconstruct the model's training data? This work studies this question from the lens of a powerful informed adversary who knows all the training data points except one. We show it is feasible to reconstruct the remaining data point in this stringent threat model.
arXiv Detail & Related papers (2022-01-13T09:19:25Z)
Reconstructing Training Data from Diverse ML Models by Ensemble Inversion [8.414622657659168]
Model Inversion (MI), in which an adversary abuses access to a trained Machine Learning (ML) model, has attracted increasing research attention. We propose an ensemble inversion technique that estimates the distribution of original training data by training a generator constrained by an ensemble of trained models. We achieve high quality results without any dataset and show how utilizing an auxiliary dataset that's similar to the presumed training data improves the results.
arXiv Detail & Related papers (2021-11-05T18:59:01Z)
Off-Policy Meta-Reinforcement Learning Based on Feature Embedding Spaces [14.029933823101084]
We propose a novel off-policy meta-RL method, embedding learning and evaluation of uncertainty (ELUE) ELUE learns a belief model over the embedding space and a belief-conditional policy and Q-function. We demonstrate that ELUE outperforms state-of-the-art meta RL methods through experiments on meta-RL benchmarks.
arXiv Detail & Related papers (2021-01-06T05:51:38Z)
How Does Data Augmentation Affect Privacy in Machine Learning? [94.52721115660626]
We propose new MI attacks to utilize the information of augmented data. We establish the optimal membership inference when the model is trained with augmented data.
arXiv Detail & Related papers (2020-07-21T02:21:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.