Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models
- URL: http://arxiv.org/abs/2402.06659v2
- Date: Mon, 14 Oct 2024 16:17:34 GMT
- Title: Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models
- Authors: Yuancheng Xu, Jiarui Yao, Manli Shu, Yanchao Sun, Zichu Wu, Ning Yu, Tom Goldstein, Furong Huang,
- Abstract summary: This study takes the first step in exposing Vision-Language Models' susceptibility to data poisoning attacks.
We introduce Shadowcast, a stealthy data poisoning attack where poison samples are visually indistinguishable from benign images.
We show that Shadowcast effectively achieves the attacker's intentions using as few as 50 poison samples.
- Score: 73.37389786808174
- License:
- Abstract: Vision-Language Models (VLMs) excel in generating textual responses from visual inputs, but their versatility raises security concerns. This study takes the first step in exposing VLMs' susceptibility to data poisoning attacks that can manipulate responses to innocuous, everyday prompts. We introduce Shadowcast, a stealthy data poisoning attack where poison samples are visually indistinguishable from benign images with matching texts. Shadowcast demonstrates effectiveness in two attack types. The first is a traditional Label Attack, tricking VLMs into misidentifying class labels, such as confusing Donald Trump for Joe Biden. The second is a novel Persuasion Attack, leveraging VLMs' text generation capabilities to craft persuasive and seemingly rational narratives for misinformation, such as portraying junk food as healthy. We show that Shadowcast effectively achieves the attacker's intentions using as few as 50 poison samples. Crucially, the poisoned samples demonstrate transferability across different VLM architectures, posing a significant concern in black-box settings. Moreover, Shadowcast remains potent under realistic conditions involving various text prompts, training data augmentation, and image compression techniques. This work reveals how poisoned VLMs can disseminate convincing yet deceptive misinformation to everyday, benign users, emphasizing the importance of data integrity for responsible VLM deployments. Our code is available at: https://github.com/umd-huang-lab/VLM-Poisoning.
Related papers
- The Victim and The Beneficiary: Exploiting a Poisoned Model to Train a Clean Model on Poisoned Data [4.9676716806872125]
backdoor attacks have posed a serious security threat to the training process of deep neural networks (DNNs)
We propose a novel dual-network training framework: The Victim and The Beneficiary (V&B), which exploits a poisoned model to train a clean model without extra benign samples.
Our framework is effective in preventing backdoor injection and robust to various attacks while maintaining the performance on benign samples.
arXiv Detail & Related papers (2024-04-17T11:15:58Z) - ImgTrojan: Jailbreaking Vision-Language Models with ONE Image [37.80216561793555]
We propose a novel jailbreaking attack against vision language models (VLMs)
A scenario where our poisoned (image, text) data pairs are included in the training data is assumed.
By replacing the original textual captions with malicious jailbreak prompts, our method can perform jailbreak attacks with the poisoned images.
arXiv Detail & Related papers (2024-03-05T12:21:57Z) - Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks [58.10730906004818]
Typographic attacks, adding misleading text to images, can deceive vision-language models (LVLMs)
Our experiments show these attacks significantly reduce classification performance by up to 60%.
arXiv Detail & Related papers (2024-02-01T14:41:20Z) - From Trojan Horses to Castle Walls: Unveiling Bilateral Data Poisoning Effects in Diffusion Models [19.140908259968302]
We investigate whether BadNets-like data poisoning methods can directly degrade the generation by DMs.
We show that a BadNets-like data poisoning attack remains effective in DMs for producing incorrect images.
Poisoned DMs exhibit an increased ratio of triggers, a phenomenon we refer to as trigger amplification'
arXiv Detail & Related papers (2023-11-04T11:00:31Z) - Nightshade: Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models [26.301156075883483]
We show that poisoning attacks can be successful on generative models.
We introduce Nightshade, an optimized prompt-specific poisoning attack.
We show that Nightshade attacks can destabilize general features in a text-to-image generative model.
arXiv Detail & Related papers (2023-10-20T21:54:10Z) - Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models [102.63973600144308]
Open-source large language models can be easily subverted to generate harmful content.
Experiments across 8 models released by 5 different organizations demonstrate the effectiveness of shadow alignment attack.
This study serves as a clarion call for a collective effort to overhaul and fortify the safety of open-source LLMs against malicious attackers.
arXiv Detail & Related papers (2023-10-04T16:39:31Z) - Adversarial Examples Make Strong Poisons [55.63469396785909]
We show that adversarial examples, originally intended for attacking pre-trained models, are even more effective for data poisoning than recent methods designed specifically for poisoning.
Our method, adversarial poisoning, is substantially more effective than existing poisoning methods for secure dataset release.
arXiv Detail & Related papers (2021-06-21T01:57:14Z) - Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching [56.280018325419896]
Data Poisoning attacks modify training data to maliciously control a model trained on such data.
We analyze a particularly malicious poisoning attack that is both "from scratch" and "clean label"
We show that it is the first poisoning method to cause targeted misclassification in modern deep networks trained from scratch on a full-sized, poisoned ImageNet dataset.
arXiv Detail & Related papers (2020-09-04T16:17:54Z) - Spanning Attack: Reinforce Black-box Attacks with Unlabeled Data [96.92837098305898]
Black-box attacks aim to craft adversarial perturbations by querying input-output pairs of machine learning models.
Black-box attacks often suffer from the issue of query inefficiency due to the high dimensionality of the input space.
We propose a novel technique called the spanning attack, which constrains adversarial perturbations in a low-dimensional subspace via spanning an auxiliary unlabeled dataset.
arXiv Detail & Related papers (2020-05-11T05:57:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.