Perturb Your Data: Paraphrase-Guided Training Data Watermarking
- URL: http://arxiv.org/abs/2512.17075v1
- Date: Thu, 18 Dec 2025 21:17:16 GMT
- Title: Perturb Your Data: Paraphrase-Guided Training Data Watermarking
- Authors: Pranav Shetty, Mirazul Haque, Petr Babkin, Zhiqiang Ma, Xiaomo Liu, Manuela Veloso,
- Abstract summary: SPECTRA is a watermarking approach that makes training data reliably detectable even when it comprises less than 0.001% of the training corpus.<n>We demonstrate that SPECTRA achieves a consistent p-value gap of over nine orders of magnitude when detecting data used for training versus data not used for training.
- Score: 20.738856513256238
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training data detection is critical for enforcing copyright and data licensing, as Large Language Models (LLM) are trained on massive text corpora scraped from the internet. We present SPECTRA, a watermarking approach that makes training data reliably detectable even when it comprises less than 0.001% of the training corpus. SPECTRA works by paraphrasing text using an LLM and assigning a score based on how likely each paraphrase is, according to a separate scoring model. A paraphrase is chosen so that its score closely matches that of the original text, to avoid introducing any distribution shifts. To test whether a suspect model has been trained on the watermarked data, we compare its token probabilities against those of the scoring model. We demonstrate that SPECTRA achieves a consistent p-value gap of over nine orders of magnitude when detecting data used for training versus data not used for training, which is greater than all baselines tested. SPECTRA equips data owners with a scalable, deploy-before-release watermark that survives even large-scale LLM training.
Related papers
- Extracting alignment data in open models [50.81383232591576]
We show that it is possible to extract significant amounts of alignment training data from a post-trained model.<n>This data is useful to steer the model to improve certain capabilities such as long-context reasoning, safety, instruction following, and maths.<n>We find that models readily regurgitate training data that was used in post-training phases such as SFT or RL.
arXiv Detail & Related papers (2025-10-21T12:06:00Z) - Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning [11.722958734691021]
We show that indirect data poisoning can effectively protect a dataset and trace its use.<n>We make a model learn arbitrary secret sequences: secret responses to secret prompts that are absent from the training corpus.<n>We validate our approach on language models pre-trained from scratch and show that less than 0.005% of poisoned tokens are sufficient to covertly make a LM learn a secret.
arXiv Detail & Related papers (2025-06-17T18:46:45Z) - STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings [17.175065729425825]
STAMP is a framework for detecting dataset membership.<n>We show that our framework can successfully detect contamination across four benchmarks which appear only once in the training data.
arXiv Detail & Related papers (2025-04-18T02:25:08Z) - Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method [108.56493934296687]
We introduce a divergence-based calibration method, inspired by the divergence-from-randomness concept, to calibrate token probabilities for pretraining data detection.<n>We have developed a Chinese-language benchmark, PatentMIA, to assess the performance of detection approaches for LLMs on Chinese text.
arXiv Detail & Related papers (2024-09-23T07:55:35Z) - Improving Pretraining Data Using Perplexity Correlations [56.41097718862742]
We present a framework that selects high-quality pretraining data without any LLM training of our own.<n>We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations.<n>Our approach outperforms DSIR on every benchmark, while matching the best data selector found in DataComp-LM.
arXiv Detail & Related papers (2024-09-09T17:23:29Z) - Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.<n>Our method is able to work under gray-box conditions without access to model training data or weights.<n>We evaluate the degree of data leakage of 35 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - Adaptive Pre-training Data Detection for Large Language Models via Surprising Tokens [1.2549198550400134]
Large language models (LLMs) are extensively used, but there are concerns regarding privacy, security, and copyright due to their opaque training data.
Current solutions to this problem leverage techniques explored in machine learning privacy such as Membership Inference Attacks (MIAs)
We propose an adaptive pre-training data detection method which alleviates this reliance and effectively amplify the identification.
arXiv Detail & Related papers (2024-07-30T23:43:59Z) - Safe Training with Sensitive In-domain Data: Leveraging Data Fragmentation To Mitigate Linkage Attacks [2.8186733524862158]
Current text generation models are trained using real data which can potentially contain sensitive information.
We propose a safer alternative which sees fragmented data in the form of domain-specific short phrases randomly grouped together.
arXiv Detail & Related papers (2024-04-30T12:09:55Z) - DE-COP: Detecting Copyrighted Content in Language Models Training Data [24.15936677068714]
We propose DE-COP, a method to determine whether a piece of copyrighted content was included in training.
We construct BookTection, a benchmark with excerpts from 165 books published prior and subsequent to a model's training cutoff.
Experiments show that DE-COP surpasses the prior best method by 9.6% in detection performance.
arXiv Detail & Related papers (2024-02-15T12:17:15Z) - Detecting Pretraining Data from Large Language Models [90.12037980837738]
We study the pretraining data detection problem.
Given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text?
We introduce a new detection method Min-K% Prob based on a simple hypothesis.
arXiv Detail & Related papers (2023-10-25T17:21:23Z) - Uncertainty-aware Self-training for Text Classification with Few Labels [54.13279574908808]
We study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck.
We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network.
We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models.
arXiv Detail & Related papers (2020-06-27T08:13:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.