Related papers: PoisonedParrot: Subtle Data Poisoning Attacks to Elicit Copyright-Infringing Content from Large Language Models

PoisonedParrot: Subtle Data Poisoning Attacks to Elicit Copyright-Infringing Content from Large Language Models

URL: http://arxiv.org/abs/2503.07697v1
Date: Mon, 10 Mar 2025 17:13:30 GMT
Title: PoisonedParrot: Subtle Data Poisoning Attacks to Elicit Copyright-Infringing Content from Large Language Models
Authors: Michael-Andrei Panaitescu-Liess, Pankayaraj Pathmanathan, Yigitcan Kaya, Zora Che, Bang An, Sicheng Zhu, Aakriti Agrawal, Furong Huang,
Abstract summary: We introduce PoisonedParrot: the first stealthy data poisoning attack that induces an LLM to generate copyrighted content.<n>Despite its simplicity, PoisonedParrot is surprisingly effective at priming the model to generate copyrighted content with no discernible side effects.<n>We make the first attempt at mitigating copyright-infringement attacks by proposing a defense: ParrotTrap.
Score: 31.384367168115503
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As the capabilities of large language models (LLMs) continue to expand, their usage has become increasingly prevalent. However, as reflected in numerous ongoing lawsuits regarding LLM-generated content, addressing copyright infringement remains a significant challenge. In this paper, we introduce PoisonedParrot: the first stealthy data poisoning attack that induces an LLM to generate copyrighted content even when the model has not been directly trained on the specific copyrighted material. PoisonedParrot integrates small fragments of copyrighted text into the poison samples using an off-the-shelf LLM. Despite its simplicity, evaluated in a wide range of experiments, PoisonedParrot is surprisingly effective at priming the model to generate copyrighted content with no discernible side effects. Moreover, we discover that existing defenses are largely ineffective against our attack. Finally, we make the first attempt at mitigating copyright-infringement poisoning attacks by proposing a defense: ParrotTrap. We encourage the community to explore this emerging threat model further.

Related papers

ME: Trigger Element Combination Backdoor Attack on Copyright Infringement [61.06621533874629]
SilentBadDiffusion (SBD) is a method proposed recently, which shew outstanding performance in attacking SD in text-to-image tasks.<n>In this paper, we raised new datasets accessible for researching in attacks like SBD, and proposed Multi-Element (ME) attack method based on SBD.<n>The Copyright Infringement Rate (CIR) / First Attack Epoch (FAE) we got on the two new datasets were 16.78% / 39.50 and 51.20% / 23.60, respectively close to or even outperformed benchmark Pokemon and Mijourney datasets.
arXiv Detail & Related papers (2025-06-12T14:51:27Z)
Certified Mitigation of Worst-Case LLM Copyright Infringement [46.571805194176825]
"copyright takedown" methods are aimed at preventing models from generating content substantially similar to copyrighted ones. We propose BloomScrub, a remarkably simple yet highly effective inference-time approach that provides certified copyright takedown. Our results suggest that lightweight, inference-time methods can be surprisingly effective for copyright prevention.
arXiv Detail & Related papers (2025-04-22T17:16:53Z)
CopyrightShield: Spatial Similarity Guided Backdoor Defense against Copyright Infringement in Diffusion Models [61.06621533874629]
diffusion model is a prime target for copyright infringement attacks.<n>This paper provides an in-depth analysis of the spatial similarity of replication in diffusion model.<n>We propose a novel defense method specifically targeting copyright infringement attacks.
arXiv Detail & Related papers (2024-12-02T14:19:44Z)
SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation [24.644101178288476]
Large Language Models (LLMs) have transformed machine learning but raised significant legal concerns. LLMs may infringe on copyrights or overly restrict non-copyrighted texts. We propose lightweight, real-time defense to prevent the generation of copyrighted text.
arXiv Detail & Related papers (2024-06-18T18:00:03Z)
Defending LLMs against Jailbreaking Attacks via Backtranslation [61.878363293735624]
We propose a new method for defending LLMs against jailbreaking attacks by backtranslation'' The inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt. We empirically demonstrate that our defense significantly outperforms the baselines.
arXiv Detail & Related papers (2024-02-26T10:03:33Z)
Round Trip Translation Defence against Large Language Model Jailbreaking Attacks [11.593052831056841]
We propose the first algorithm specifically designed to defend against social-engineered attacks on large language models. Our defense successfully mitigated over 70% of Prompt Automatic Iterative Refinement (PAIR) attacks. We are also the first to attempt mitigating the MathsAttack and reduced its attack success rate by almost 40%.
arXiv Detail & Related papers (2024-02-21T03:59:52Z)
The Stronger the Diffusion Model, the Easier the Backdoor: Data Poisoning to Induce Copyright Breaches Without Adjusting Finetuning Pipeline [30.80691226540351]
We formalized the Copyright Infringement Attack on generative AI models and proposed a backdoor attack method, SilentBadDiffusion. Our method strategically embeds connections between pieces of copyrighted information and text references in poisoning data. Our experiments show the stealth and efficacy of the poisoning data.
arXiv Detail & Related papers (2024-01-07T08:37:29Z)
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs) Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
Breaking the De-Pois Poisoning Defense [0.0]
We show that the attack-agnostic De-Pois defense is hardly an exception to that rule. In our work, we break this poison-protection layer by replicating the critic model and then performing a composed gradient-sign attack on both the critic and target models simultaneously.
arXiv Detail & Related papers (2022-04-03T15:17:47Z)
MultAV: Multiplicative Adversarial Videos [71.94264837503135]
We propose a novel attack method against video recognition models, Multiplicative Adversarial Videos (MultAV) MultAV imposes perturbation on video data by multiplication. Experimental results show that the model adversarially trained against additive attack is less robust to MultAV.
arXiv Detail & Related papers (2020-09-17T04:34:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.