Related papers: Towards Copyright Protection for Knowledge Bases of Retrieval-augmented Language Models via Reasoning

Towards Copyright Protection for Knowledge Bases of Retrieval-augmented Language Models via Reasoning

URL: http://arxiv.org/abs/2502.10440v2
Date: Fri, 23 May 2025 15:35:10 GMT
Title: Towards Copyright Protection for Knowledge Bases of Retrieval-augmented Language Models via Reasoning
Authors: Junfeng Guo, Yiming Li, Ruibo Chen, Yihan Wu, Chenxi Liu, Yanshuo Chen, Heng Huang,
Abstract summary: Large language models (LLMs) are increasingly integrated into real-world personalized applications.<n>The valuable and often proprietary nature of the knowledge bases used in RAG introduces the risk of unauthorized usage by adversaries.<n>Existing methods that can be generalized as watermarking techniques to protect these knowledge bases typically involve poisoning or backdoor attacks.<n>We propose name for harmless' copyright protection of knowledge bases.
Score: 58.57194301645823
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly integrated into real-world personalized applications through retrieval-augmented generation (RAG) mechanisms to supplement their responses with domain-specific knowledge. However, the valuable and often proprietary nature of the knowledge bases used in RAG introduces the risk of unauthorized usage by adversaries. Existing methods that can be generalized as watermarking techniques to protect these knowledge bases typically involve poisoning or backdoor attacks. However, these methods require altering the LLM's results of verification samples, inevitably making these watermarks susceptible to anomaly detection and even introducing new security risks. To address these challenges, we propose \name{} for `harmless' copyright protection of knowledge bases. Instead of manipulating LLM's final output, \name{} implants distinct yet benign verification behaviors in the space of chain-of-thought (CoT) reasoning, maintaining the correctness of the final answer. Our method has three main stages: (1) Generating CoTs: For each verification question, we generate two `innocent' CoTs, including a target CoT for building watermark behaviors; (2) Optimizing Watermark Phrases and Target CoTs: Inspired by our theoretical analysis, we optimize them to minimize retrieval errors under the \emph{black-box} and \emph{text-only} setting of suspicious LLM, ensuring that only watermarked verification queries can retrieve their correspondingly target CoTs contained in the knowledge base; (3) Ownership Verification: We exploit a pairwise Wilcoxon test to verify whether a suspicious LLM is augmented with the protected knowledge base by comparing its responses to watermarked and benign verification queries. Our experiments on diverse benchmarks demonstrate that \name{} effectively protects knowledge bases and its resistance to adaptive attacks.

Related papers

LeakSealer: A Semisupervised Defense for LLMs Against Prompt Injection and Leakage Attacks [7.115093658017371]
LeakSealer is a model-agnostic framework that combines static analysis for forensic insights with dynamic defenses in a Human-In-The-Loop pipeline.<n>We empirically evaluate LeakSealer under two scenarios: (1) jailbreak attempts, employing a public benchmark dataset, and (2) PII leakage, supported by a curated dataset of labeled LLM interactions.
arXiv Detail & Related papers (2025-08-01T13:04:28Z)
When LLMs Copy to Think: Uncovering Copy-Guided Attacks in Reasoning LLMs [30.532439965854767]
Large Language Models (LLMs) have become integral to automated code analysis, enabling tasks such as vulnerability detection and code comprehension.<n>In this paper, we identify and investigate a new class of prompt-based attacks, termed Copy-Guided Attacks (CGA)<n>We show that CGA reliably induces infinite loops, premature termination, false refusals, and semantic distortions in code analysis tasks.
arXiv Detail & Related papers (2025-07-22T17:21:36Z)
Explicit Vulnerability Generation with LLMs: An Investigation Beyond Adversarial Attacks [0.5218155982819203]
Large Language Models (LLMs) are increasingly used as code assistants.<n>This study examines a more direct threat: open-source LLMs generating vulnerable code when prompted.
arXiv Detail & Related papers (2025-07-14T08:36:26Z)
CoTGuard: Using Chain-of-Thought Triggering for Copyright Protection in Multi-Agent LLM Systems [55.57181090183713]
We introduce CoTGuard, a novel framework for copyright protection that leverages trigger-based detection within Chain-of-Thought reasoning.<n>Specifically, we can activate specific CoT segments and monitor intermediate reasoning steps for unauthorized content reproduction by embedding specific trigger queries into agent prompts.<n>This approach enables fine-grained, interpretable detection of copyright violations in collaborative agent scenarios.
arXiv Detail & Related papers (2025-05-26T01:42:37Z)
In-Context Watermarks for Large Language Models [71.29952527565749]
In-Context Watermarking (ICW) embeds watermarks into generated text solely through prompt engineering.<n>We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method.<n>Our experiments validate the feasibility of ICW as a model-agnostic, practical watermarking approach.
arXiv Detail & Related papers (2025-05-22T17:24:51Z)
Defending LLM Watermarking Against Spoofing Attacks with Contrastive Representation Learning [34.76886510334969]
A piggyback attack can maliciously alter the meaning of watermarked text-transforming it into hate speech-while preserving the original watermark.<n>We propose a semantic-aware watermarking algorithm that embeds watermarks into a given target text while preserving its original meaning.
arXiv Detail & Related papers (2025-04-09T04:38:17Z)
Reason2Attack: Jailbreaking Text-to-Image Models via LLM Reasoning [34.73320827764541]
Text-to-Image(T2I) models typically deploy safety filters to prevent the generation of sensitive images. Recent jailbreaking attack methods manually design prompts for the LLM to generate adversarial prompts. We propose Reason2Attack(R2A), which aims to enhance the LLM's reasoning capabilities in generating adversarial prompts.
arXiv Detail & Related papers (2025-03-23T08:40:39Z)
Dataset Protection via Watermarked Canaries in Retrieval-Augmented LLMs [67.0310240737424]
We introduce a novel approach to safeguard the ownership of text datasets and effectively detect unauthorized use by the RA-LLMs.<n>Our approach preserves the original data completely unchanged while protecting it by inserting specifically designed canary documents into the IP dataset.<n>During the detection process, unauthorized usage is identified by querying the canary documents and analyzing the responses of RA-LLMs.
arXiv Detail & Related papers (2025-02-15T04:56:45Z)
RAG-WM: An Efficient Black-Box Watermarking Approach for Retrieval-Augmented Generation of Large Language Models [24.88433543377822]
We propose a novel black-box "knowledge watermark" approach, named RAG-WM, to detect IP infringement of RAGs.<n>RAG-WM uses a multi-LLM interaction framework to create watermark texts based on watermark entity-relationships and inject them into the target RAG.<n> Experimental results show that RAG-WM effectively detects the stolen RAGs in various deployed LLMs.
arXiv Detail & Related papers (2025-01-09T14:01:15Z)
TrustRAG: Enhancing Robustness and Trustworthiness in RAG [31.231916859341865]
TrustRAG is a framework that systematically filters compromised and irrelevant contents before they are retrieved for generation.<n>TrustRAG delivers substantial improvements in retrieval accuracy, efficiency, and attack resistance compared to existing approaches.
arXiv Detail & Related papers (2025-01-01T15:57:34Z)
Bileve: Securing Text Provenance in Large Language Models Against Spoofing with Bi-level Signature [39.973130114073605]
We introduce a bi-level signature scheme, Bileve, which embeds fine-grained signature bits for integrity checks. Bileve can differentiate 5 scenarios during detection, reliably tracing text and regulating LLMs.
arXiv Detail & Related papers (2024-06-04T03:58:14Z)
DIP-Watermark: A Double Identity Protection Method Based on Robust Adversarial Watermark [13.007649270429493]
Face Recognition (FR) systems pose privacy risks. One countermeasure is adversarial attack, deceiving unauthorized malicious FR. We propose the first double identity protection scheme based on traceable adversarial watermarking.
arXiv Detail & Related papers (2024-04-23T02:50:38Z)
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks. This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs. We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z)
WatME: Towards Lossless Watermarking Through Lexical Redundancy [58.61972059246715]
This study assesses the impact of watermarking on different capabilities of large language models (LLMs) from a cognitive science lens. We introduce Watermarking with Mutual Exclusion (WatME) to seamlessly integrate watermarks.
arXiv Detail & Related papers (2023-11-16T11:58:31Z)
Certifying LLM Safety against Adversarial Prompting [70.96868018621167]
Large language models (LLMs) are vulnerable to adversarial attacks that add malicious tokens to an input prompt.<n>We introduce erase-and-check, the first framework for defending against adversarial prompts with certifiable safety guarantees.
arXiv Detail & Related papers (2023-09-06T04:37:20Z)
Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users. Recent works have proposed algorithms to detect LLM-generated text and protect LLMs. We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z)
ADC: Adversarial attacks against object Detection that evade Context consistency checks [55.8459119462263]
We show that even context consistency checks can be brittle to properly crafted adversarial examples. We propose an adaptive framework to generate examples that subvert such defenses. Our results suggest that how to robustly model context and check its consistency, is still an open problem.
arXiv Detail & Related papers (2021-10-24T00:25:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.