Related papers: Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks

Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks

URL: http://arxiv.org/abs/2402.00626v2
Date: Fri, 16 Feb 2024 15:15:38 GMT
Title: Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks
Authors: Maan Qraitem, Nazia Tasnim, Piotr Teterwak, Kate Saenko, Bryan A. Plummer
Abstract summary: Typographic Attacks, which involve pasting misleading text onto an image, were noted to harm the performance of Vision-Language Models like CLIP. We introduce two novel and more effective textitSelf-Generated attacks which prompt the LVLM to generate an attack against itself. Using our benchmark, we uncover that Self-Generated attacks pose a significant threat, reducing LVLM(s) classification performance by up to 33%.
Score: 62.34019142949628
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Typographic Attacks, which involve pasting misleading text onto an image, were noted to harm the performance of Vision-Language Models like CLIP. However, the susceptibility of recent Large Vision-Language Models to these attacks remains understudied. Furthermore, prior work's Typographic attacks against CLIP randomly sample a misleading class from a predefined set of categories. However, this simple strategy misses more effective attacks that exploit LVLM(s) stronger language skills. To address these issues, we first introduce a benchmark for testing Typographic attacks against LVLM(s). Moreover, we introduce two novel and more effective \textit{Self-Generated} attacks which prompt the LVLM to generate an attack against itself: 1) Class Based Attack where the LVLM (e.g. LLaVA) is asked which deceiving class is most similar to the target class and 2) Descriptive Attacks where a more advanced LVLM (e.g. GPT4-V) is asked to recommend a Typographic attack that includes both a deceiving class and description. Using our benchmark, we uncover that Self-Generated attacks pose a significant threat, reducing LVLM(s) classification performance by up to 33\%. We also uncover that attacks generated by one model (e.g. GPT-4V or LLaVA) are effective against the model itself and other models like InstructBLIP and MiniGPT4. Code: \url{https://github.com/mqraitem/Self-Gen-Typo-Attack}

Related papers

MTAttack: Multi-Target Backdoor Attacks against Large Vision-Language Models [52.37749859972453]
We propose MTAttack, the first multi-target backdoor attack framework for enforcing accurate multiple trigger-target mappings in LVLMs.<n> Experiments on popular benchmarks demonstrate a high success rate of MTAttack for multi-target attacks.<n>Our attack exhibits strong generalizability across datasets and robustness against backdoor defense strategies.
arXiv Detail & Related papers (2025-11-13T09:00:21Z)
TokenSwap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models [57.32952956674526]
We introduce TokenSwap, a more evasive and stealthy backdoor attack on large vision-language models (LVLMs)<n>Instead of enforcing a fixed targeted content, TokenSwap subtly disrupts the understanding of object relationships in text.<n> TokenSwap achieves high attack success rates while maintaining superior evasiveness and stealthiness.
arXiv Detail & Related papers (2025-09-29T10:19:22Z)
Web Artifact Attacks Disrupt Vision Language Models [61.59021920232986]
Vision-language models (VLMs) are trained on large-scale, lightly curated web datasets. They learn unintended correlations between semantic concepts and unrelated visual signals. Prior work has weaponized these correlations as an attack vector to manipulate model predictions. We introduce artifact-based attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements.
arXiv Detail & Related papers (2025-03-17T18:59:29Z)
Typographic Attacks in a Multi-Image Setting [2.9154316123656927]
We introduce a multi-image setting for studying typographic attacks. Specifically, our focus is on attacking image sets without repeating the attack query. We introduce two attack strategies for the multi-image setting, leveraging the difficulty of the target image, the strength of the attack text, and text-image similarity.
arXiv Detail & Related papers (2025-02-12T08:10:25Z)
A Realistic Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks. Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text. We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing.
arXiv Detail & Related papers (2024-10-21T17:27:01Z)
CLIP-Guided Generative Networks for Transferable Targeted Adversarial Attacks [52.29186466633699]
Transferable targeted adversarial attacks aim to mislead models into outputting adversary-specified predictions in black-box scenarios. textitsingle-target generative attacks train a generator for each target class to generate highly transferable perturbations. textbfCLIP-guided textbfGenerative textbfNetwork with textbfCross-attention modules (CGNC) to enhance multi-target attacks.
arXiv Detail & Related papers (2024-07-14T12:30:32Z)
Adversarial Attacks on Multimodal Agents [73.97379283655127]
Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. We show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment.
arXiv Detail & Related papers (2024-06-18T17:32:48Z)
Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model [23.764618459753326]
The Typographic Attack has also been expected to be a security threat to LVLMs. We verify typographic attacks on current well-known commercial and open-source LVLMs. To better assess this vulnerability, we propose the most comprehensive and largest-scale Typographic dataset to date.
arXiv Detail & Related papers (2024-02-29T13:31:56Z)
Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements. We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z)
Does Few-shot Learning Suffer from Backdoor Attacks? [63.9864247424967]
We show that few-shot learning can still be vulnerable to backdoor attacks. Our method demonstrates a high Attack Success Rate (ASR) in FSL tasks with different few-shot learning paradigms. This study reveals that few-shot learning still suffers from backdoor attacks, and its security should be given attention.
arXiv Detail & Related papers (2023-12-31T06:43:36Z)
InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models [13.21813503235793]
Large vision-language models (LVLMs) have demonstrated their incredible capability in image understanding and response generation. In this paper, we formulate a novel and practical targeted attack scenario that the adversary can only know the vision encoder of the victim LVLM. We propose an instruction-tuned targeted attack (dubbed textscInstructTA) to deliver the targeted adversarial attack on LVLMs with high transferability.
arXiv Detail & Related papers (2023-12-04T13:40:05Z)
Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Against Text Classifiers [25.94356063000699]
Backdoor attacks manipulate model predictions by inserting innocuous triggers into training and test data. We focus on more realistic and more challenging clean-label attacks where the adversarial training examples are correctly labeled. Our attack, LLMBkd, leverages language models to automatically insert diverse style-based triggers into texts.
arXiv Detail & Related papers (2023-10-28T06:11:07Z)
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs) Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
Two-in-One: A Model Hijacking Attack Against Text Generation Models [19.826236952700256]
We propose a new model hijacking attack, Ditto, that can hijack different text classification tasks into multiple generation ones. Our results show that by using Ditto, an adversary can successfully hijack text generation models without jeopardizing their utility.
arXiv Detail & Related papers (2023-05-12T12:13:27Z)
Defense-Prefix for Preventing Typographic Attacks on CLIP [14.832208701208414]
Some adversarial attacks fool a model into false or absurd classifications. We introduce our simple yet effective method: Defense-Prefix (DP), which inserts the DP token before a class name to make words "robust" against typographic attacks. Our method significantly improves the accuracy of classification tasks for typographic attack datasets, while maintaining the zero-shot capabilities of the model.
arXiv Detail & Related papers (2023-04-10T11:05:20Z)
BERT-Defense: A Probabilistic Model Based on BERT to Combat Cognitively Inspired Orthographic Adversarial Attacks [10.290050493635343]
Adversarial attacks expose important blind spots of deep learning systems. Character-level attacks typically insert typos into the input stream. We show that an untrained iterative approach can perform on par with human crowd-workers supervised via 3-shot learning.
arXiv Detail & Related papers (2021-06-02T20:21:03Z)
Hidden Backdoor Attack against Semantic Segmentation Models [60.0327238844584]
The emphbackdoor attack intends to embed hidden backdoors in deep neural networks (DNNs) by poisoning training data. We propose a novel attack paradigm, the emphfine-grained attack, where we treat the target label from the object-level instead of the image-level. Experiments show that the proposed methods can successfully attack semantic segmentation models by poisoning only a small proportion of training data.
arXiv Detail & Related papers (2021-03-06T05:50:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.