A Watermark for Black-Box Language Models
- URL: http://arxiv.org/abs/2410.02099v2
- Date: Thu, 03 Apr 2025 21:13:21 GMT
- Title: A Watermark for Black-Box Language Models
- Authors: Dara Bahri, John Wieting,
- Abstract summary: We propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM.<n>We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.
- Score: 31.772364827073808
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require white-box access to the model's next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. black-box access), boasts a distortion-free property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.
Related papers
- Learning on LLM Output Signatures for gray-box LLM Behavior Analysis [52.81120759532526]
Large Language Models (LLMs) have achieved widespread adoption, yet our understanding of their behavior remains limited.
We develop a transformer-based approach to process that theoretically guarantees approximation of existing techniques.
Our approach achieves superior performance on hallucination and data contamination detection in gray-box settings.
arXiv Detail & Related papers (2025-03-18T09:04:37Z) - Logits are All We Need to Adapt Closed Models [15.227768874282834]
Many commercial Large Language Models (LLMs) are often closed-source, limiting developers to prompt tuning for aligning content generation with specific applications.
We argue that if such access were available, it would enable more powerful adaptation techniques beyond prompt engineering.
We propose a token-level probability reweighting framework that steers black-box LLMs toward application-specific content generation.
arXiv Detail & Related papers (2025-02-03T22:24:22Z) - Predicting the Performance of Black-box LLMs through Self-Queries [60.87193950962585]
Large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial.
In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations.
We demonstrate that training a linear model on these low-dimensional representations produces reliable predictors of model performance at the instance level.
arXiv Detail & Related papers (2025-01-02T22:26:54Z) - Glimpse: Enabling White-Box Methods to Use Proprietary Models for Zero-Shot LLM-Generated Text Detection [15.902823469821431]
**Glimpse** is a probability distribution estimation approach, predicting the full distributions from partial observations.
We extend white-box methods like Entropy, Rank, Log-Rank, and Fast-DetectGPT to latest proprietary models.
Experiments show that Glimpse with Fast-DetectGPT and GPT-3.5 achieves an average AUROC of about 0.95 in five latest source models.
arXiv Detail & Related papers (2024-12-16T07:28:36Z) - Zer0-Jack: A Memory-efficient Gradient-based Jailbreaking Method for Black-box Multi-modal Large Language Models [2.740881223898167]
We propose Zer0-Jack, a method that bypasses the need for white-box access by leveraging zeroth-order optimization.
Zer0-Jack achieves a high attack success rate across various models, surpassing previous transfer-based methods.
We show that Zer0-Jack can directly attack commercial MLLMs such as GPT-4o.
arXiv Detail & Related papers (2024-11-12T05:24:02Z) - NSmark: Null Space Based Black-box Watermarking Defense Framework for Pre-trained Language Models [24.864736672581937]
We propose a task-agnostic, black-box watermarking scheme capable of resisting LL-LFEA attacks.
NSmark consists of three phases: (i) watermark generation using the digital signature of the owner, enhanced by spread spectrum modulation for increased robustness; (ii) watermark embedding through an output mapping extractor that preserves PLM performance while maximizing watermark capacity; (iii) watermark verification, assessed by extraction rate and null space conformity.
arXiv Detail & Related papers (2024-10-16T14:45:27Z) - UTF:Undertrained Tokens as Fingerprints A Novel Approach to LLM Identification [23.164580168870682]
Fingerprinting large language models (LLMs) is essential for verifying model ownership, ensuring authenticity, and preventing misuse.
In this paper, we introduce a novel and efficient approach to fingerprinting LLMs by leveraging under-trained tokens.
Our method has minimal overhead and impact on model's performance, and does not require white-box access to target model's ownership identification.
arXiv Detail & Related papers (2024-10-16T07:36:57Z) - On Unsupervised Prompt Learning for Classification with Black-box Language Models [71.60563181678323]
Large language models (LLMs) have achieved impressive success in text-formatted learning problems.
LLMs can label datasets with even better quality than skilled human annotators.
In this paper, we propose unsupervised prompt learning for classification with black-box LLMs.
arXiv Detail & Related papers (2024-10-04T03:39:28Z) - Large Language Model Watermark Stealing With Mixed Integer Programming [51.336009662771396]
Large Language Model (LLM) watermark shows promise in addressing copyright, monitoring AI-generated text, and preventing its misuse.
Recent research indicates that watermarking methods using numerous keys are susceptible to removal attacks.
We propose a novel green list stealing attack against the state-of-the-art LLM watermark scheme.
arXiv Detail & Related papers (2024-05-30T04:11:17Z) - Black-Box Detection of Language Model Watermarks [1.9374282535132377]
We develop rigorous statistical tests to detect the presence of all three most popular watermarking scheme families using only a limited number of black-box queries.
Our findings indicate that current watermarking schemes are more detectable than previously believed, and that obscuring the fact that a watermark was deployed may not be a viable way for providers to protect against adversaries.
arXiv Detail & Related papers (2024-05-28T08:41:30Z) - AquaLoRA: Toward White-box Protection for Customized Stable Diffusion Models via Watermark LoRA [67.68750063537482]
Diffusion models have achieved remarkable success in generating high-quality images.
Recent works aim to let SD models output watermarked content for post-hoc forensics.
We propose textttmethod as the first implementation under this scenario.
arXiv Detail & Related papers (2024-05-18T01:25:47Z) - Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models [121.0693322732454]
This paper proposes a textbfCraFT' approach for fine-tuning black-box vision-language models to downstream tasks.
CraFT comprises two modules, a prompt generation module for learning text prompts and a prediction refinement module for enhancing output predictions in residual style.
Experiments on few-shot classification over 15 datasets demonstrate the superiority of CraFT.
arXiv Detail & Related papers (2024-02-06T14:53:19Z) - A Semantic Invariant Robust Watermark for Large Language Models [27.522264953691746]
Prior watermark algorithms face a trade-off between attack robustness and security robustness.
This is because the watermark logits for a token are determined by a certain number of preceding tokens.
We propose a semantic invariant watermarking method for LLMs that provides both attack robustness and security robustness.
arXiv Detail & Related papers (2023-10-10T06:49:43Z) - A Watermark for Large Language Models [84.95327142027183]
We propose a watermarking framework for proprietary language models.
The watermark can be embedded with negligible impact on text quality.
It can be detected using an efficient open-source algorithm without access to the language model API or parameters.
arXiv Detail & Related papers (2023-01-24T18:52:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.