Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks
- URL: http://arxiv.org/abs/2402.00626v2
- Date: Fri, 16 Feb 2024 15:15:38 GMT
- Title: Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks
- Authors: Maan Qraitem, Nazia Tasnim, Piotr Teterwak, Kate Saenko, Bryan A.
Plummer
- Abstract summary: Typographic Attacks, which involve pasting misleading text onto an image, were noted to harm the performance of Vision-Language Models like CLIP.
We introduce two novel and more effective textitSelf-Generated attacks which prompt the LVLM to generate an attack against itself.
Using our benchmark, we uncover that Self-Generated attacks pose a significant threat, reducing LVLM(s) classification performance by up to 33%.
- Score: 62.34019142949628
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Typographic Attacks, which involve pasting misleading text onto an image,
were noted to harm the performance of Vision-Language Models like CLIP.
However, the susceptibility of recent Large Vision-Language Models to these
attacks remains understudied. Furthermore, prior work's Typographic attacks
against CLIP randomly sample a misleading class from a predefined set of
categories. However, this simple strategy misses more effective attacks that
exploit LVLM(s) stronger language skills. To address these issues, we first
introduce a benchmark for testing Typographic attacks against LVLM(s).
Moreover, we introduce two novel and more effective \textit{Self-Generated}
attacks which prompt the LVLM to generate an attack against itself: 1) Class
Based Attack where the LVLM (e.g. LLaVA) is asked which deceiving class is most
similar to the target class and 2) Descriptive Attacks where a more advanced
LVLM (e.g. GPT4-V) is asked to recommend a Typographic attack that includes
both a deceiving class and description. Using our benchmark, we uncover that
Self-Generated attacks pose a significant threat, reducing LVLM(s)
classification performance by up to 33\%. We also uncover that attacks
generated by one model (e.g. GPT-4V or LLaVA) are effective against the model
itself and other models like InstructBLIP and MiniGPT4. Code:
\url{https://github.com/mqraitem/Self-Gen-Typo-Attack}
Related papers
- Typographic Attacks in a Multi-Image Setting [2.9154316123656927]
We introduce a multi-image setting for studying typographic attacks.
Specifically, our focus is on attacking image sets without repeating the attack query.
We introduce two attack strategies for the multi-image setting, leveraging the difficulty of the target image, the strength of the attack text, and text-image similarity.
arXiv Detail & Related papers (2025-02-12T08:10:25Z) - A Realistic Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks.
Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text.
We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing.
arXiv Detail & Related papers (2024-10-21T17:27:01Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - Does Few-shot Learning Suffer from Backdoor Attacks? [63.9864247424967]
We show that few-shot learning can still be vulnerable to backdoor attacks.
Our method demonstrates a high Attack Success Rate (ASR) in FSL tasks with different few-shot learning paradigms.
This study reveals that few-shot learning still suffers from backdoor attacks, and its security should be given attention.
arXiv Detail & Related papers (2023-12-31T06:43:36Z) - InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models [13.21813503235793]
Large vision-language models (LVLMs) have demonstrated their incredible capability in image understanding and response generation.
In this paper, we formulate a novel and practical targeted attack scenario that the adversary can only know the vision encoder of the victim LVLM.
We propose an instruction-tuned targeted attack (dubbed textscInstructTA) to deliver the targeted adversarial attack on LVLMs with high transferability.
arXiv Detail & Related papers (2023-12-04T13:40:05Z) - Large Language Models Are Better Adversaries: Exploring Generative
Clean-Label Backdoor Attacks Against Text Classifiers [25.94356063000699]
Backdoor attacks manipulate model predictions by inserting innocuous triggers into training and test data.
We focus on more realistic and more challenging clean-label attacks where the adversarial training examples are correctly labeled.
Our attack, LLMBkd, leverages language models to automatically insert diverse style-based triggers into texts.
arXiv Detail & Related papers (2023-10-28T06:11:07Z) - Two-in-One: A Model Hijacking Attack Against Text Generation Models [19.826236952700256]
We propose a new model hijacking attack, Ditto, that can hijack different text classification tasks into multiple generation ones.
Our results show that by using Ditto, an adversary can successfully hijack text generation models without jeopardizing their utility.
arXiv Detail & Related papers (2023-05-12T12:13:27Z) - Can Adversarial Examples Be Parsed to Reveal Victim Model Information? [62.814751479749695]
In this work, we ask whether it is possible to infer data-agnostic victim model (VM) information from data-specific adversarial instances.
We collect a dataset of adversarial attacks across 7 attack types generated from 135 victim models.
We show that a simple, supervised model parsing network (MPN) is able to infer VM attributes from unseen adversarial attacks.
arXiv Detail & Related papers (2023-03-13T21:21:49Z) - BERT-Defense: A Probabilistic Model Based on BERT to Combat Cognitively
Inspired Orthographic Adversarial Attacks [10.290050493635343]
Adversarial attacks expose important blind spots of deep learning systems.
Character-level attacks typically insert typos into the input stream.
We show that an untrained iterative approach can perform on par with human crowd-workers supervised via 3-shot learning.
arXiv Detail & Related papers (2021-06-02T20:21:03Z) - Hidden Backdoor Attack against Semantic Segmentation Models [60.0327238844584]
The emphbackdoor attack intends to embed hidden backdoors in deep neural networks (DNNs) by poisoning training data.
We propose a novel attack paradigm, the emphfine-grained attack, where we treat the target label from the object-level instead of the image-level.
Experiments show that the proposed methods can successfully attack semantic segmentation models by poisoning only a small proportion of training data.
arXiv Detail & Related papers (2021-03-06T05:50:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.