Data Provenance Auditing of Fine-Tuned Large Language Models with a Text-Preserving Technique
- URL: http://arxiv.org/abs/2510.09655v1
- Date: Tue, 07 Oct 2025 08:34:08 GMT
- Title: Data Provenance Auditing of Fine-Tuned Large Language Models with a Text-Preserving Technique
- Authors: Yanming Li, Seifeddine Ghozzi, Cédric Eichler, Nicolas Anciaux, Alexandra Bensamoun, Lorena Gonzalez Manzano,
- Abstract summary: We introduce a text-preserving watermarking framework that embeds invisible Unicode characters into documents.<n>We experimentally observe a failure rate of less than 0.1% when detecting a reply after fine-tuning with 50 marked documents.<n>No spurious reply was recovered in over 18,000 challenges, corresponding to a 100%TPR@0% FPR.
- Score: 36.96848724920411
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address the problem of auditing whether sensitive or copyrighted texts were used to fine-tune large language models (LLMs) under black-box access. Prior signals-verbatim regurgitation and membership inference-are unreliable at the level of individual documents or require altering the visible text. We introduce a text-preserving watermarking framework that embeds sequences of invisible Unicode characters into documents. Each watermark is split into a cue (embedded in odd chunks) and a reply (embedded in even chunks). At audit time, we submit prompts that contain only the cue; the presence of the corresponding reply in the model's output provides evidence of memorization consistent with training on the marked text. To obtain sound decisions, we compare the score of the published watermark against a held-out set of counterfactual watermarks and apply a ranking test with a provable false-positive-rate bound. The design is (i) minimally invasive (no visible text changes), (ii) scalable to many users and documents via a large watermark space and multi-watermark attribution, and (iii) robust to common passive transformations. We evaluate on open-weight LLMs and multiple text domains, analyzing regurgitation dynamics, sensitivity to training set size, and interference under multiple concurrent watermarks. Our results demonstrate reliable post-hoc provenance signals with bounded FPR under black-box access. We experimentally observe a failure rate of less than 0.1\% when detecting a reply after fine-tuning with 50 marked documents. Conversely, no spurious reply was recovered in over 18,000 challenges, corresponding to a 100\%TPR@0\% FPR. Moreover, detection rates remain relatively stable as the dataset size increases, maintaining a per-document detection rate above 45\% even when the marked collection accounts for less than 0.33\% of the fine-tuning data.
Related papers
- MirrorMark: A Distortion-Free Multi-Bit Watermark for Large Language Models [5.735801967350819]
We propose MirrorMark, a distortion-free watermark for large language models (LLMs)<n>MirrorMark embeds multi-bit messages without altering the token probability distribution, preserving text quality by design.<n> Experiments show that MirrorMark matches the text quality of non-watermarked generation while achieving substantially stronger detectability.
arXiv Detail & Related papers (2026-01-29T19:10:48Z) - How Good is Post-Hoc Watermarking With Language Model Rephrasing? [43.5649433230903]
Generation-time text watermarking embeds statistical signals into text for traceability of AI-generated content.<n>We explore post-hoc watermarking where an LLM rewrites existing text while applying generation-time watermarking.<n>Our strategies achieve strong detectability and semantic fidelity on open-ended text such as books.
arXiv Detail & Related papers (2025-12-18T18:57:33Z) - Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking [51.74368870268278]
We propose TRACE, a framework for fully black-box detection of copyrighted dataset usage in large language models.<n>textttTRACE rewrites datasets with distortion-free watermarks guided by a private key.<n>Across diverse datasets and model families, TRACE consistently achieves significant detections.
arXiv Detail & Related papers (2025-10-03T12:53:02Z) - Watermarking Language Models for Many Adaptive Users [47.90822587139056]
We study watermarking schemes for language models with provable guarantees.
We introduce multi-user watermarks, which allow tracing model-generated text to individual users.
We prove that the undetectable zero-bit scheme of Christ, Gunn, and Zamir (2024) is adaptively robust.
arXiv Detail & Related papers (2024-05-17T22:15:30Z) - Duwak: Dual Watermarks in Large Language Models [49.00264962860555]
We propose, Duwak, to enhance the efficiency and quality of watermarking by embedding dual secret patterns in both token probability distribution and sampling schemes.
We evaluate Duwak extensively on Llama2, against four state-of-the-art watermarking techniques and combinations of them.
arXiv Detail & Related papers (2024-03-12T16:25:38Z) - Provably Robust Multi-bit Watermarking for AI-generated Text [37.21416140194606]
Large Language Models (LLMs) have demonstrated remarkable capabilities of generating texts resembling human language.<n>They can be misused by criminals to create deceptive content, such as fake news and phishing emails.<n> Watermarking is a key technique to address these concerns, which embeds a message into a text.
arXiv Detail & Related papers (2024-01-30T08:46:48Z) - Watermark Text Pattern Spotting in Document Images [3.6298655794854464]
In the wild, writing can come in various fonts, sizes and forms, making generic recognition a very difficult problem.
We propose a novel benchmark (K-Watermark) containing 65,447 data samples generated using Wrender.
A validity study using humans raters yields an authenticity score of 0.51 against pre-generated watermarked documents.
arXiv Detail & Related papers (2024-01-10T14:02:45Z) - On the Reliability of Watermarks for Large Language Models [95.87476978352659]
We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document.
We find that watermarks remain detectable even after human and machine paraphrasing.
We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document.
arXiv Detail & Related papers (2023-06-07T17:58:48Z) - A Watermark for Large Language Models [84.95327142027183]
We propose a watermarking framework for proprietary language models.
The watermark can be embedded with negligible impact on text quality.
It can be detected using an efficient open-source algorithm without access to the language model API or parameters.
arXiv Detail & Related papers (2023-01-24T18:52:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.