$B^4$: A Black-Box Scrubbing Attack on LLM Watermarks
- URL: http://arxiv.org/abs/2411.01222v3
- Date: Thu, 07 Nov 2024 01:26:43 GMT
- Title: $B^4$: A Black-Box Scrubbing Attack on LLM Watermarks
- Authors: Baizhou Huang, Xiao Pu, Xiaojun Wan,
- Abstract summary: Watermarking has emerged as a prominent technique for content detection by embedding imperceptible patterns.
Previous work typically considers a grey-box attack setting, where the specific type of watermark is already known.
We here propose $B4$, a black-box scrubbing attack on watermarks.
- Score: 42.933100948624315
- License:
- Abstract: Watermarking has emerged as a prominent technique for LLM-generated content detection by embedding imperceptible patterns. Despite supreme performance, its robustness against adversarial attacks remains underexplored. Previous work typically considers a grey-box attack setting, where the specific type of watermark is already known. Some even necessitates knowledge about hyperparameters of the watermarking method. Such prerequisites are unattainable in real-world scenarios. Targeting at a more realistic black-box threat model with fewer assumptions, we here propose $B^4$, a black-box scrubbing attack on watermarks. Specifically, we formulate the watermark scrubbing attack as a constrained optimization problem by capturing its objectives with two distributions, a Watermark Distribution and a Fidelity Distribution. This optimization problem can be approximately solved using two proxy distributions. Experimental results across 12 different settings demonstrate the superior performance of $B^4$ compared with other baselines.
Related papers
- Invisible Watermarks: Attacks and Robustness [0.3495246564946556]
We introduce novel improvements to watermarking robustness and minimize degradation on image quality during attack.
We propose a custom watermark remover network which preserves one of the watermarking modalities while completely removing the other during decoding.
Our evaluation suggests that 1) implementing the watermark remover model to preserve one of the watermark modalities when decoding the other modality slightly improves on the baseline performance, and that 2) LBA degrades the image significantly less compared to uniform blurring of the entire image.
arXiv Detail & Related papers (2024-12-17T03:50:13Z) - Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models [16.57738116313139]
We show that attackers can leverage unrelated models, even with different latent spaces and architectures, to perform powerful and realistic forgery attacks.
The first imprints a targeted watermark into real images by manipulating the latent representation of an arbitrary image in an unrelated LDM.
The second attack generates new images with the target watermark by inverting a watermarked image and re-generating it with an arbitrary prompt.
arXiv Detail & Related papers (2024-12-04T12:57:17Z) - The Efficacy of Transfer-based No-box Attacks on Image Watermarking: A Pragmatic Analysis [11.724935807582513]
We investigate the robustness of image watermarking methods in the no-box'' setting, where the attacker is assumed to have no knowledge about the watermarking model.
We show that when the configuration is mostly aligned, a simple non-optimization attack can already exceed the success of optimization-based efforts.
arXiv Detail & Related papers (2024-12-03T17:02:49Z) - An undetectable watermark for generative image models [65.31658824274894]
We present the first undetectable watermarking scheme for generative image models.
In particular, an undetectable watermark does not degrade image quality under any efficiently computable metric.
Our scheme works by selecting the initial latents of a diffusion model using a pseudorandom error-correcting code.
arXiv Detail & Related papers (2024-10-09T18:33:06Z) - Can Watermarked LLMs be Identified by Users via Crafted Prompts? [55.460327393792156]
This work is the first to investigate the imperceptibility of watermarked Large Language Models (LLMs)
We design an identification algorithm called Water-Probe that detects watermarks through well-designed prompts.
Experiments show that almost all mainstream watermarking algorithms are easily identified with our well-designed prompts.
arXiv Detail & Related papers (2024-10-04T06:01:27Z) - Large Language Model Watermark Stealing With Mixed Integer Programming [51.336009662771396]
Large Language Model (LLM) watermark shows promise in addressing copyright, monitoring AI-generated text, and preventing its misuse.
Recent research indicates that watermarking methods using numerous keys are susceptible to removal attacks.
We propose a novel green list stealing attack against the state-of-the-art LLM watermark scheme.
arXiv Detail & Related papers (2024-05-30T04:11:17Z) - Black-Box Detection of Language Model Watermarks [1.9374282535132377]
We develop rigorous statistical tests to detect the presence of all three most popular watermarking scheme families using only a limited number of black-box queries.
Our findings indicate that current watermarking schemes are more detectable than previously believed, and that obscuring the fact that a watermark was deployed may not be a viable way for providers to protect against adversaries.
arXiv Detail & Related papers (2024-05-28T08:41:30Z) - Watermark Stealing in Large Language Models [2.1165011830664673]
We show that querying the API of the watermarked LLM to approximately reverse-engineer a watermark enables practical spoofing attacks.
We are the first to propose an automated WS algorithm and use it in the first comprehensive study of spoofing and scrubbing in realistic settings.
arXiv Detail & Related papers (2024-02-29T17:12:39Z) - Certified Neural Network Watermarks with Randomized Smoothing [64.86178395240469]
We propose a certifiable watermarking method for deep learning models.
We show that our watermark is guaranteed to be unremovable unless the model parameters are changed by more than a certain l2 threshold.
Our watermark is also empirically more robust compared to previous watermarking methods.
arXiv Detail & Related papers (2022-07-16T16:06:59Z) - Fine-tuning Is Not Enough: A Simple yet Effective Watermark Removal
Attack for DNN Models [72.9364216776529]
We propose a novel watermark removal attack from a different perspective.
We design a simple yet powerful transformation algorithm by combining imperceptible pattern embedding and spatial-level transformations.
Our attack can bypass state-of-the-art watermarking solutions with very high success rates.
arXiv Detail & Related papers (2020-09-18T09:14:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.