Watermark Robustness and Radioactivity May Be at Odds in Federated Learning
- URL: http://arxiv.org/abs/2510.17033v1
- Date: Sun, 19 Oct 2025 22:39:29 GMT
- Title: Watermark Robustness and Radioactivity May Be at Odds in Federated Learning
- Authors: Leixu Huang, Zedian Shao, Teodora Baluta,
- Abstract summary: Federated learning (FL) enables fine-tuning large language models (LLMs) across distributed data sources.<n>We adapt watermarking for data provenance in FL where a subset of clients compute local updates on watermarked data, and the server averages all updates into the global LLM.<n>Our work suggests fundamental trade-offs between radioactivity, robustness, and utility.
- Score: 3.6503955888587245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Federated learning (FL) enables fine-tuning large language models (LLMs) across distributed data sources. As these sources increasingly include LLM-generated text, provenance tracking becomes essential for accountability and transparency. We adapt LLM watermarking for data provenance in FL where a subset of clients compute local updates on watermarked data, and the server averages all updates into the global LLM. In this setup, watermarks are radioactive: the watermark signal remains detectable after fine-tuning with high confidence. The $p$-value can reach $10^{-24}$ even when as little as $6.6\%$ of data is watermarked. However, the server can act as an active adversary that wants to preserve model utility while evading provenance tracking. Our observation is that updates induced by watermarked synthetic data appear as outliers relative to non-watermark updates. Our adversary thus applies strong robust aggregation that can filter these outliers, together with the watermark signal. All evaluated radioactive watermarks are not robust against such an active filtering server. Our work suggests fundamental trade-offs between radioactivity, robustness, and utility.
Related papers
- Position: LLM Watermarking Should Align Stakeholders' Incentives for Practical Adoption [94.887133335656]
We revisit three classes of watermarking through this lens.<n>emphLLM text watermarking offers modest provider benefit when framed solely as an anti-misuse tool.<n>emphIn-context watermarking (ICW) is tailored for trusted parties, such as conference organizers or educators.
arXiv Detail & Related papers (2025-10-21T06:34:51Z) - Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking [51.74368870268278]
We propose TRACE, a framework for fully black-box detection of copyrighted dataset usage in large language models.<n>textttTRACE rewrites datasets with distortion-free watermarks guided by a private key.<n>Across diverse datasets and model families, TRACE consistently achieves significant detections.
arXiv Detail & Related papers (2025-10-03T12:53:02Z) - Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation? [75.99961894619986]
This paper investigates whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance.<n>We propose two categories of watermark removal approaches: pre-distillation removal through untargeted and targeted training data paraphrasing (UP and TP), and post-distillation removal through inference-time watermark neutralization (WN)
arXiv Detail & Related papers (2025-02-17T09:34:19Z) - Learning to Watermark LLM-generated Text via Reinforcement Learning [16.61005372279407]
We study how to watermark LLM outputs to track misuse.
We design a model-level watermark that embeds signals into the output.
We propose a co-training framework based on reinforcement learning.
arXiv Detail & Related papers (2024-03-13T03:43:39Z) - Watermarking Makes Language Models Radioactive [24.123479478427594]
It is possible to reliably determine if a language model was trained on synthetic data if that data is output by a watermarked LLM.
Our new methods, specialized for radioactivity, detect with a provable confidence weak residuals of the watermark signal.
For instance, if the suspect model is open-weight, we demonstrate that training on watermarked instructions can be detected with high confidence.
arXiv Detail & Related papers (2024-02-22T18:55:22Z) - Proving membership in LLM pretraining data via data watermarks [20.57538940552033]
This work proposes using data watermarks to enable principled detection with only black-box model access.
We study two watermarks: one that inserts random sequences, and another that randomly substitutes characters with Unicode lookalikes.
We show that we can robustly detect hashes from BLOOM-176B's training data, as long as they occurred at least 90 times.
arXiv Detail & Related papers (2024-02-16T18:49:27Z) - Turning Your Strength into Watermark: Watermarking Large Language Model via Knowledge Injection [66.26348985345776]
We propose a novel watermarking method for large language models (LLMs) based on knowledge injection.
In the watermark embedding stage, we first embed the watermarks into the selected knowledge to obtain the watermarked knowledge.
In the watermark extraction stage, questions related to the watermarked knowledge are designed, for querying the suspect LLM.
Experiments show that the watermark extraction success rate is close to 100% and demonstrate the effectiveness, fidelity, stealthiness, and robustness of our proposed method.
arXiv Detail & Related papers (2023-11-16T03:22:53Z) - A Resilient and Accessible Distribution-Preserving Watermark for Large Language Models [65.40460716619772]
Our research focuses on the importance of a textbfDistribution-textbfPreserving (DiP) watermark.
Contrary to the current strategies, our proposed DiPmark simultaneously preserves the original token distribution during watermarking.
It is detectable without access to the language model API and prompts (accessible), and is provably robust to moderate changes of tokens.
arXiv Detail & Related papers (2023-10-11T17:57:35Z) - Unbiased Watermark for Large Language Models [67.43415395591221]
This study examines how significantly watermarks impact the quality of model-generated outputs.
It is possible to integrate watermarks without affecting the output probability distribution.
The presence of watermarks does not compromise the performance of the model in downstream tasks.
arXiv Detail & Related papers (2023-09-22T12:46:38Z) - On the Reliability of Watermarks for Large Language Models [95.87476978352659]
We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document.
We find that watermarks remain detectable even after human and machine paraphrasing.
We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document.
arXiv Detail & Related papers (2023-06-07T17:58:48Z) - Did You Train on My Dataset? Towards Public Dataset Protection with
Clean-Label Backdoor Watermarking [54.40184736491652]
We propose a backdoor-based watermarking approach that serves as a general framework for safeguarding public-available data.
By inserting a small number of watermarking samples into the dataset, our approach enables the learning model to implicitly learn a secret function set by defenders.
This hidden function can then be used as a watermark to track down third-party models that use the dataset illegally.
arXiv Detail & Related papers (2023-03-20T21:54:30Z) - On the Effectiveness of Dataset Watermarking in Adversarial Settings [14.095584034871658]
We investigate a proposed data provenance method, radioactive data, to assess if it can be used to demonstrate ownership of (image) datasets used to train machine learning (ML) models.
We show that radioactive data can effectively survive model extraction attacks, which raises the possibility that it can be used for ML model ownership verification robust against model extraction.
arXiv Detail & Related papers (2022-02-25T05:51:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.