On the Evidentiary Limits of Membership Inference for Copyright Auditing
- URL: http://arxiv.org/abs/2601.12937v1
- Date: Mon, 19 Jan 2026 10:46:51 GMT
- Title: On the Evidentiary Limits of Membership Inference for Copyright Auditing
- Authors: Murat Bilgehan Ertan, Emirhan Böge, Min Chen, Kaleel Mahmood, Marten van Dijk,
- Abstract summary: We ask whether membership inference attacks (MIAs) can serve as admissible evidence in adversarial copyright disputes.<n>We introduce SAGE, a paraphrasing framework guided by Sparse Autoencoders (SAEs) that rewrites training data to alter lexical structure.<n>Experiments show that state-of-the-art MIAs degrade when models are fine-tuned on SAGE-generated paraphrases, indicating that their signals are not robust to semantics-preserving transformations.
- Score: 8.81439045962811
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) are trained on increasingly opaque corpora, membership inference attacks (MIAs) have been proposed to audit whether copyrighted texts were used during training, despite growing concerns about their reliability under realistic conditions. We ask whether MIAs can serve as admissible evidence in adversarial copyright disputes where an accused model developer may obfuscate training data while preserving semantic content, and formalize this setting through a judge-prosecutor-accused communication protocol. To test robustness under this protocol, we introduce SAGE (Structure-Aware SAE-Guided Extraction), a paraphrasing framework guided by Sparse Autoencoders (SAEs) that rewrites training data to alter lexical structure while preserving semantic content and downstream utility. Our experiments show that state-of-the-art MIAs degrade when models are fine-tuned on SAGE-generated paraphrases, indicating that their signals are not robust to semantics-preserving transformations. While some leakage remains in certain fine-tuning regimes, these results suggest that MIAs are brittle in adversarial settings and insufficient, on their own, as a standalone mechanism for copyright auditing of LLMs.
Related papers
- Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement [13.976796671311066]
Large language models (LLMs) remain vulnerable to jailbreak prompts that are fluent and semantically coherent.<n>We introduce a self-supervised framework for disentangling semantic factor pairs in LLM activations at inference.<n>We then propose FrameShield, an anomaly detector operating on the framing representations, which improves model-agnostic detection.
arXiv Detail & Related papers (2026-02-23T00:11:30Z) - Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z) - Refinement Provenance Inference: Detecting LLM-Refined Training Prompts from Model Behavior [58.751981587234916]
This paper formalizes the Refinement Provenance Inference (RPI) audit task as Refinement Provenance Inference (RPI)<n>We propose RePro, a logit-based framework that fuses teacher-forced likelihood features with logit-ranking signals.<n>During training, RePro learns a transferable representation via shadow fine-tuning, and uses a lightweight linear head to infer provenance on unseen victims without training-data access.
arXiv Detail & Related papers (2026-01-05T10:16:41Z) - SCOPE: Intrinsic Semantic Space Control for Mitigating Copyright Infringement in LLMs [39.14996705577274]
SCOPE is an inference-time method that requires no parameter updates or auxiliary filters.<n>We identify a copyright-sensitive subspace and clamp its activations during decoding.<n>Experiments on widely recognized benchmarks show that SCOPE mitigates copyright infringement without degrading general utility.
arXiv Detail & Related papers (2025-11-10T11:53:07Z) - SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking [58.475471437150674]
We propose sequential watermarking for soft prompts (SWAP)<n>SWAP encodes watermarks through a specific order of defender-specified out-of-distribution classes.<n>Experiments on 11 datasets demonstrate SWAP's effectiveness, harmlessness, and robustness against potential adaptive attacks.
arXiv Detail & Related papers (2025-11-05T13:48:48Z) - Copyright Infringement Detection in Text-to-Image Diffusion Models via Differential Privacy [23.262369771803364]
We formalize the concept of copyright infringement and its detection from the perspective of Differential Privacy (DP)<n>We propose D-Plus-Minus (DPM), a novel post-hoc detection framework that identifies copyright infringement in text-to-image diffusion models.<n>Our results demonstrate that DPM reliably detects infringement content without requiring access to the original training dataset or text prompts.
arXiv Detail & Related papers (2025-09-27T00:38:12Z) - ISACL: Internal State Analyzer for Copyrighted Training Data Leakage [28.435965753598875]
Large Language Models (LLMs) pose risks of inadvertently exposing copyrighted or proprietary data.<n>This study introduces a proactive approach: examining LLMs' internal states before text generation to detect potential leaks.<n> integrated with a Retrieval-Augmented Generation (RAG) system, this framework ensures adherence to copyright and licensing requirements.
arXiv Detail & Related papers (2025-08-25T08:04:20Z) - CoTGuard: Using Chain-of-Thought Triggering for Copyright Protection in Multi-Agent LLM Systems [55.57181090183713]
We introduce CoTGuard, a novel framework for copyright protection that leverages trigger-based detection within Chain-of-Thought reasoning.<n>Specifically, we can activate specific CoT segments and monitor intermediate reasoning steps for unauthorized content reproduction by embedding specific trigger queries into agent prompts.<n>This approach enables fine-grained, interpretable detection of copyright violations in collaborative agent scenarios.
arXiv Detail & Related papers (2025-05-26T01:42:37Z) - Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding [118.75567341513897]
Existing methods typically analyze target text in isolation or solely with non-member contexts.<n>We propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts.
arXiv Detail & Related papers (2024-09-05T09:10:38Z) - Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data? [62.72729485995075]
We investigate the effectiveness of watermarking as a deterrent against the generation of copyrighted texts.<n>We find that watermarking adversely affects the success rate of Membership Inference Attacks (MIAs)<n>We propose an adaptive technique to improve the success rate of a recent MIA under watermarking.
arXiv Detail & Related papers (2024-07-24T16:53:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.