Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks
- URL: http://arxiv.org/abs/2402.00626v3
- Date: Thu, 13 Feb 2025 03:11:20 GMT
- Title: Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks
- Authors: Maan Qraitem, Nazia Tasnim, Piotr Teterwak, Kate Saenko, Bryan A. Plummer,
- Abstract summary: Typographic attacks, adding misleading text to images, can deceive vision-language models (LVLMs)<n>Our experiments show these attacks significantly reduce classification performance by up to 60%.
- Score: 58.10730906004818
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Typographic attacks, adding misleading text to images, can deceive vision-language models (LVLMs). The susceptibility of recent large LVLMs like GPT4-V to such attacks is understudied, raising concerns about amplified misinformation in personal assistant applications. Previous attacks use simple strategies, such as random misleading words, which don't fully exploit LVLMs' language reasoning abilities. We introduce an experimental setup for testing typographic attacks on LVLMs and propose two novel self-generated attacks: (1) Class-based attacks, where the model identifies a similar class to deceive itself, and (2) Reasoned attacks, where an advanced LVLM suggests an attack combining a deceiving class and description. Our experiments show these attacks significantly reduce classification performance by up to 60\% and are effective across different models, including InstructBLIP and MiniGPT4. Code: https://github.com/mqraitem/Self-Gen-Typo-Attack
Related papers
- Web Artifact Attacks Disrupt Vision Language Models [61.59021920232986]
Vision-language models (VLMs) are trained on large-scale, lightly curated web datasets.
They learn unintended correlations between semantic concepts and unrelated visual signals.
Prior work has weaponized these correlations as an attack vector to manipulate model predictions.
We introduce artifact-based attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements.
arXiv Detail & Related papers (2025-03-17T18:59:29Z) - Typographic Attacks in a Multi-Image Setting [2.9154316123656927]
We introduce a multi-image setting for studying typographic attacks.
Specifically, our focus is on attacking image sets without repeating the attack query.
We introduce two attack strategies for the multi-image setting, leveraging the difficulty of the target image, the strength of the attack text, and text-image similarity.
arXiv Detail & Related papers (2025-02-12T08:10:25Z) - A Realistic Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks.
Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text.
We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing.
arXiv Detail & Related papers (2024-10-21T17:27:01Z) - CLIP-Guided Generative Networks for Transferable Targeted Adversarial Attacks [52.29186466633699]
Transferable targeted adversarial attacks aim to mislead models into outputting adversary-specified predictions in black-box scenarios.
textitsingle-target generative attacks train a generator for each target class to generate highly transferable perturbations.
textbfCLIP-guided textbfGenerative textbfNetwork with textbfCross-attention modules (CGNC) to enhance multi-target attacks.
arXiv Detail & Related papers (2024-07-14T12:30:32Z) - Adversarial Attacks on Multimodal Agents [73.97379283655127]
Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments.
We show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment.
arXiv Detail & Related papers (2024-06-18T17:32:48Z) - Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model [23.764618459753326]
The Typographic Attack has also been expected to be a security threat to LVLMs.
We verify typographic attacks on current well-known commercial and open-source LVLMs.
To better assess this vulnerability, we propose the most comprehensive and largest-scale Typographic dataset to date.
arXiv Detail & Related papers (2024-02-29T13:31:56Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - Does Few-shot Learning Suffer from Backdoor Attacks? [63.9864247424967]
We show that few-shot learning can still be vulnerable to backdoor attacks.
Our method demonstrates a high Attack Success Rate (ASR) in FSL tasks with different few-shot learning paradigms.
This study reveals that few-shot learning still suffers from backdoor attacks, and its security should be given attention.
arXiv Detail & Related papers (2023-12-31T06:43:36Z) - InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models [13.21813503235793]
Large vision-language models (LVLMs) have demonstrated their incredible capability in image understanding and response generation.
In this paper, we formulate a novel and practical targeted attack scenario that the adversary can only know the vision encoder of the victim LVLM.
We propose an instruction-tuned targeted attack (dubbed textscInstructTA) to deliver the targeted adversarial attack on LVLMs with high transferability.
arXiv Detail & Related papers (2023-12-04T13:40:05Z) - Large Language Models Are Better Adversaries: Exploring Generative
Clean-Label Backdoor Attacks Against Text Classifiers [25.94356063000699]
Backdoor attacks manipulate model predictions by inserting innocuous triggers into training and test data.
We focus on more realistic and more challenging clean-label attacks where the adversarial training examples are correctly labeled.
Our attack, LLMBkd, leverages language models to automatically insert diverse style-based triggers into texts.
arXiv Detail & Related papers (2023-10-28T06:11:07Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z) - Two-in-One: A Model Hijacking Attack Against Text Generation Models [19.826236952700256]
We propose a new model hijacking attack, Ditto, that can hijack different text classification tasks into multiple generation ones.
Our results show that by using Ditto, an adversary can successfully hijack text generation models without jeopardizing their utility.
arXiv Detail & Related papers (2023-05-12T12:13:27Z) - Defense-Prefix for Preventing Typographic Attacks on CLIP [14.832208701208414]
Some adversarial attacks fool a model into false or absurd classifications.
We introduce our simple yet effective method: Defense-Prefix (DP), which inserts the DP token before a class name to make words "robust" against typographic attacks.
Our method significantly improves the accuracy of classification tasks for typographic attack datasets, while maintaining the zero-shot capabilities of the model.
arXiv Detail & Related papers (2023-04-10T11:05:20Z) - BERT-Defense: A Probabilistic Model Based on BERT to Combat Cognitively
Inspired Orthographic Adversarial Attacks [10.290050493635343]
Adversarial attacks expose important blind spots of deep learning systems.
Character-level attacks typically insert typos into the input stream.
We show that an untrained iterative approach can perform on par with human crowd-workers supervised via 3-shot learning.
arXiv Detail & Related papers (2021-06-02T20:21:03Z) - Hidden Backdoor Attack against Semantic Segmentation Models [60.0327238844584]
The emphbackdoor attack intends to embed hidden backdoors in deep neural networks (DNNs) by poisoning training data.
We propose a novel attack paradigm, the emphfine-grained attack, where we treat the target label from the object-level instead of the image-level.
Experiments show that the proposed methods can successfully attack semantic segmentation models by poisoning only a small proportion of training data.
arXiv Detail & Related papers (2021-03-06T05:50:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.