Related papers: BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models

BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models

URL: http://arxiv.org/abs/2503.16023v1
Date: Thu, 20 Mar 2025 10:39:51 GMT
Title: BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models
Authors: Zenghui Yuan, Jiawen Shi, Pan Zhou, Neil Zhenqiang Gong, Lichao Sun,
Abstract summary: Multi-modal large language models (MLLMs) process multi-modal information, enabling them to generate responses to image-text inputs.<n> MLLMs have been incorporated into diverse multi-modal applications, such as autonomous driving and medical diagnosis, via plug-and-play without fine-tuning.<n>We propose BadToken, the first token-level backdoor attack to MLLMs.
Score: 79.36881186707413
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-modal large language models (MLLMs) extend large language models (LLMs) to process multi-modal information, enabling them to generate responses to image-text inputs. MLLMs have been incorporated into diverse multi-modal applications, such as autonomous driving and medical diagnosis, via plug-and-play without fine-tuning. This deployment paradigm increases the vulnerability of MLLMs to backdoor attacks. However, existing backdoor attacks against MLLMs achieve limited effectiveness and stealthiness. In this work, we propose BadToken, the first token-level backdoor attack to MLLMs. BadToken introduces two novel backdoor behaviors: Token-substitution and Token-addition, which enable flexible and stealthy attacks by making token-level modifications to the original output for backdoored inputs. We formulate a general optimization problem that considers the two backdoor behaviors to maximize the attack effectiveness. We evaluate BadToken on two open-source MLLMs and various tasks. Our results show that our attack maintains the model's utility while achieving high attack success rates and stealthiness. We also show the real-world threats of BadToken in two scenarios, i.e., autonomous driving and medical diagnosis. Furthermore, we consider defenses including fine-tuning and input purification. Our results highlight the threat of our attack.

Related papers

`Do as I say not as I do': A Semi-Automated Approach for Jailbreak Prompt Attack against Multimodal LLMs [6.151779089440453]
We introduce the first voice-based jailbreak attack against multimodal large language models (LLMs) We propose a novel strategy, in which the disallowed prompt is flanked by benign, narrative-driven prompts. We demonstrate that Flanking Attack is capable of manipulating state-of-the-art LLMs into generating misaligned and forbidden outputs.
arXiv Detail & Related papers (2025-02-02T10:05:08Z)
BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models [27.59116619946915]
We introduce textitBackdoorLLM, the first comprehensive benchmark for studying backdoor attacks on Generative Large Language Models. textitBackdoorLLM features: 1) a repository of backdoor benchmarks with a standardized training pipeline, 2) diverse attack strategies, including data poisoning, weight poisoning, hidden state attacks, and chain-of-thought attacks, and 3) extensive evaluations with over 200 experiments on 8 attacks across 7 scenarios and 6 model architectures.
arXiv Detail & Related papers (2024-08-23T02:21:21Z)
MEGen: Generative Backdoor in Large Language Models via Model Editing [56.46183024683885]
Large language models (LLMs) have demonstrated remarkable capabilities. Their powerful generative abilities enable flexible responses based on various queries or instructions. This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects.
arXiv Detail & Related papers (2024-08-20T10:44:29Z)
BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger [67.75420257197186]
In this work, we propose $textbfBaThe, a simple yet effective jailbreak defense mechanism.<n>Jailbreak backdoor attack uses harmful instructions combined with manually crafted strings as triggers to make the backdoored model generate prohibited responses.<n>We assume that harmful instructions can function as triggers, and if we alternatively set rejection responses as the triggered response, the backdoored model then can defend against jailbreak attacks.
arXiv Detail & Related papers (2024-08-17T04:43:26Z)
Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers [51.0477382050976]
An extra prompt token, called the switch token in this work, can turn the backdoor mode on, converting a benign model into a backdoored one. To attack a pre-trained model, our proposed attack, named SWARM, learns a trigger and prompt tokens including a switch token. Experiments on diverse visual recognition tasks confirm the success of our switchable backdoor attack, achieving 95%+ attack success rate.
arXiv Detail & Related papers (2024-05-17T08:19:48Z)
Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements. We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z)
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models [15.381273199132433]
BadChain is the first backdoor attack against large language models (LLMs) employing chain-of-thought (COT) prompting. We show the effectiveness of BadChain for two COT strategies and six benchmark tasks. BadChain remains a severe threat to LLMs, underscoring the urgency for the development of robust and effective future defenses.
arXiv Detail & Related papers (2024-01-20T04:53:35Z)
Does Few-shot Learning Suffer from Backdoor Attacks? [63.9864247424967]
We show that few-shot learning can still be vulnerable to backdoor attacks. Our method demonstrates a high Attack Success Rate (ASR) in FSL tasks with different few-shot learning paradigms. This study reveals that few-shot learning still suffers from backdoor attacks, and its security should be given attention.
arXiv Detail & Related papers (2023-12-31T06:43:36Z)
Dual-Key Multimodal Backdoors for Visual Question Answering [26.988750557552983]
We show that multimodal networks are vulnerable to a novel type of attack that we refer to as Dual-Key Multimodal Backdoors. This attack exploits the complex fusion mechanisms used by state-of-the-art networks to embed backdoors that are both effective and stealthy. We present an extensive study of multimodal backdoors on the Visual Question Answering (VQA) task with multiple architectures and visual feature backbones.
arXiv Detail & Related papers (2021-12-14T18:59:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.