Related papers: RLCracker: Exposing the Vulnerability of LLM Watermarks with Adaptive RL Attacks

RLCracker: Exposing the Vulnerability of LLM Watermarks with Adaptive RL Attacks

URL: http://arxiv.org/abs/2509.20924v1
Date: Thu, 25 Sep 2025 09:08:02 GMT
Title: RLCracker: Exposing the Vulnerability of LLM Watermarks with Adaptive RL Attacks
Authors: Hanbo Huang, Yiran Zhang, Hao Zheng, Xuan Gong, Yihan Li, Lin Liu, Shiyu Liang,
Abstract summary: We introduce adaptive robustness radius, a formal metric that quantifies watermark resilience against adaptive adversaries.<n>We propose RLCracker, a reinforcement learning (RL)-based adaptive attack that erases watermarks while preserving semantic fidelity.<n>Our results confirm that adaptive attacks are broadly effective and pose a fundamental threat to current watermarking defenses.
Score: 18.75982610851903
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) watermarking has shown promise in detecting AI-generated content and mitigating misuse, with prior work claiming robustness against paraphrasing and text editing. In this paper, we argue that existing evaluations are not sufficiently adversarial, obscuring critical vulnerabilities and overstating the security. To address this, we introduce adaptive robustness radius, a formal metric that quantifies watermark resilience against adaptive adversaries. We theoretically prove that optimizing the attack context and model parameters can substantially reduce this radius, making watermarks highly susceptible to paraphrase attacks. Leveraging this insight, we propose RLCracker, a reinforcement learning (RL)-based adaptive attack that erases watermarks while preserving semantic fidelity. RLCracker requires only limited watermarked examples and zero access to the detector. Despite weak supervision, it empowers a 3B model to achieve 98.5% removal success and an average 0.92 P-SP score on 1,500-token Unigram-marked texts after training on only 100 short samples. This performance dramatically exceeds 6.75% by GPT-4o and generalizes across five model sizes over ten watermarking schemes. Our results confirm that adaptive attacks are broadly effective and pose a fundamental threat to current watermarking defenses.

Related papers

AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications [71.27518152526686]
Large Language Models (LLMs) excel at text comprehension and generation, making them ideal for automated tasks like code review and content moderation.<n>LLMs can be manipulated by "adversarial instructions" hidden in input data, such as resumes or code, causing them to deviate from their intended task.<n>This paper introduces a benchmark to assess this vulnerability in resume screening, revealing attack success rates exceeding 80% for certain attack types.
arXiv Detail & Related papers (2025-12-23T08:42:09Z)
LLM Watermark Evasion via Bias Inversion [24.543675977310357]
We propose the emphBias-Inversion Rewriting Attack (BIRA), which is theoretically motivated and model-agnostic.<n>BIRA weakens the watermark signal by suppressing the logits of likely watermarked tokens during rewriting, without any knowledge of the underlying watermarking scheme.
arXiv Detail & Related papers (2025-09-27T00:24:57Z)
Character-Level Perturbations Disrupt LLM Watermarks [64.60090923837701]
We formalize the system model for Large Language Model (LLM) watermarking.<n>We characterize two realistic threat models constrained on limited access to the watermark detector.<n>We demonstrate character-level perturbations are significantly more effective for watermark removal under the most restrictive threat model.<n> Experiments confirm the superiority of character-level perturbations and the effectiveness of the Genetic Algorithm (GA) in removing watermarks under realistic constraints.
arXiv Detail & Related papers (2025-09-11T02:50:07Z)
Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text [47.84655968112988]
We introduce Adversarial Paraphrasing, a training-free attack framework that universally humanizes any AI-generated text to evade detection more effectively.<n>Our attack is both broadly effective and highly transferable across several detection systems.
arXiv Detail & Related papers (2025-06-08T05:15:01Z)
Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks [36.01146548147208]
Text watermarking algorithms embed watermarks in high-entropy tokens to ensure text quality.<n>In this paper, we reveal that this seemingly benign design can be exploited by attackers, posing a significant risk to the robustness of the watermark.<n>We introduce a generic efficient paraphrasing attack, which leverages the vulnerability by calculating the self-information of each token.
arXiv Detail & Related papers (2025-05-08T12:39:00Z)
Optimizing Adaptive Attacks against Watermarks for Language Models [5.798432964668272]
Large Language Models (LLMs) can be misused to spread unwanted content at scale.<n> watermarking deters misuse by hiding messages in content, enabling its detection using a secret watermarking key.<n>We formulate watermark robustness as an objective function and use preference-based optimization to tune adaptive attacks against the specific watermarking method.
arXiv Detail & Related papers (2024-10-03T12:37:39Z)
CleanerCLIP: Fine-grained Counterfactual Semantic Augmentation for Backdoor Defense in Contrastive Learning [53.766434746801366]
We propose a fine-grained textbfText textbfAlignment textbfCleaner (TA-Cleaner) to cut off feature connections of backdoor triggers. TA-Cleaner achieves state-of-the-art defensiveness among finetuning-based defense techniques.
arXiv Detail & Related papers (2024-09-26T07:35:23Z)
Large Language Model Watermark Stealing With Mixed Integer Programming [51.336009662771396]
Large Language Model (LLM) watermark shows promise in addressing copyright, monitoring AI-generated text, and preventing its misuse. Recent research indicates that watermarking methods using numerous keys are susceptible to removal attacks. We propose a novel green list stealing attack against the state-of-the-art LLM watermark scheme.
arXiv Detail & Related papers (2024-05-30T04:11:17Z)
Leveraging Optimization for Adaptive Attacks on Image Watermarks [31.70167647613335]
Watermarking deters misuse by marking generated content with a hidden message, enabling its detection using a secret watermarking key. Assessing robustness requires designing an adaptive attack for the specific watermarking algorithm. We show that an attacker can break all five surveyed watermarking methods at no visible degradation in image quality.
arXiv Detail & Related papers (2023-09-29T03:36:42Z)
(De)Randomized Smoothing for Certifiable Defense against Patch Attacks [136.79415677706612]
We introduce a certifiable defense against patch attacks that guarantees for a given image and patch attack size. Our method is related to the broad class of randomized smoothing robustness schemes. Our results effectively establish a new state-of-the-art of certifiable defense against patch attacks on CIFAR-10 and ImageNet.
arXiv Detail & Related papers (2020-02-25T08:39:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.