Related papers: SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs

SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs

URL: http://arxiv.org/abs/2506.04743v1
Date: Thu, 05 Jun 2025 08:22:24 GMT
Title: SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs
Authors: Shuhan Xu, Siyuan Liang, Hongling Zheng, Yong Luo, Aishan Liu, Dacheng Tao,
Abstract summary: Attackers can inject imperceptible perturbations into the training data, causing the model to generate malicious, attacker-controlled captions.<n>We propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without prior knowledge of triggers.<n>SRD uses a Deep Q-Network to learn policies for applying discrete perturbations to sensitive image regions, aiming to disrupt the activation of malicious pathways.
Score: 57.880467106470775
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) have achieved remarkable performance in image captioning, but recent studies show they are vulnerable to backdoor attacks. Attackers can inject imperceptible perturbations-such as local pixel triggers or global semantic phrases-into the training data, causing the model to generate malicious, attacker-controlled captions for specific inputs. These attacks are hard to detect and defend due to their stealthiness and cross-modal nature. By analyzing attack samples, we identify two key vulnerabilities: (1) abnormal attention concentration on specific image regions, and (2) semantic drift and incoherence in generated captions. To counter this, we propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without prior knowledge of triggers. SRD uses a Deep Q-Network to learn policies for applying discrete perturbations (e.g., occlusion, color masking) to sensitive image regions, aiming to disrupt the activation of malicious pathways. We design a semantic fidelity score as the reward signal, which jointly evaluates semantic consistency and linguistic fluency of the output, guiding the agent toward generating robust yet faithful captions. Experiments across mainstream VLMs and datasets show SRD reduces attack success rates to 5.6%, while preserving caption quality on clean inputs with less than 10% performance drop. SRD offers a trigger-agnostic, interpretable defense paradigm against stealthy backdoor threats in multimodal generative models.

Related papers

BadRSSD: Backdoor Attacks on Regularized Self-Supervised Diffusion Models [10.286339414754499]
Bad RSSD is the first backdoor attack targeting the representation layer of self-supervised diffusion models.<n>It hijacks the semantic representations of poisoned samples with triggers in PCA space toward those of a target image.<n>Bad RSSD substantially outperforms existing attacks in both FID and MSE metrics.
arXiv Detail & Related papers (2026-03-01T09:56:26Z)
Semantic-level Backdoor Attack against Text-to-Image Diffusion Models [3.8542429081202947]
We propose Semantic-level Backdoor Attack (SemBD), which implants backdoors at the representation level by defining triggers as continuous semantic regions.<n>SemBD achieves a 100% attack success rate while maintaining strong robustness against state-of-the-art input-level defenses.
arXiv Detail & Related papers (2026-02-03T13:23:01Z)
BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models [63.5775877701015]
We introduce textbfBackdoorVLM, the first comprehensive benchmark for evaluating backdoor attacks on vision-language models (VLMs)<n>BackdoorVLM organizes multimodal backdoor threats into 5 representative categories: targeted refusal, malicious injection, jailbreak, concept substitution, and perceptual hijack.<n>We evaluate these threats using 12 representative attack methods spanning text, image, and bimodal triggers, tested on 2 open-source VLMs and 3 multimodal datasets.
arXiv Detail & Related papers (2025-11-24T09:30:38Z)
Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning [5.0734761482919115]
Open-vocabulary object detectors (OVODs) unify vision and language to detect arbitrary object categories based on text prompts.<n>We conduct the first study of backdoor attacks on OVODs and reveal a new attack surface introduced by prompt tuning.
arXiv Detail & Related papers (2025-11-16T19:05:31Z)
Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning [89.1856483797116]
We introduce BEAT, the first framework to inject visual backdoors into MLLM-based embodied agents.<n>Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably.<n>BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance.
arXiv Detail & Related papers (2025-10-31T16:50:49Z)
TokenSwap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models [57.32952956674526]
We introduce TokenSwap, a more evasive and stealthy backdoor attack on large vision-language models (LVLMs)<n>Instead of enforcing a fixed targeted content, TokenSwap subtly disrupts the understanding of object relationships in text.<n> TokenSwap achieves high attack success rates while maintaining superior evasiveness and stealthiness.
arXiv Detail & Related papers (2025-09-29T10:19:22Z)
IAG: Input-aware Backdoor Attack on VLMs for Visual Grounding [7.59817316342973]
We introduce a novel input-aware backdoor attack method, IAG, to manipulate the grounding behavior of vision-language models.<n>We propose an adaptive trigger generator that embeds the semantic information of the attack target's description into the original image.<n>IAG is evaluated theoretically and empirically, demonstrating its feasibility and effectiveness.
arXiv Detail & Related papers (2025-08-13T03:22:19Z)
Backdoor Attack on Vision Language Models with Stealthy Semantic Manipulation [32.24294112337828]
BadSem is a data poisoning attack that injects backdoors by deliberately misaligning image-text pairs during training.<n>Our experiments show that BadSem achieves over 98% average ASR, generalizes well to out-of-distribution datasets, and can transfer across poisoning modalities.<n>Our findings highlight the urgent need to address semantic vulnerabilities in Vision Language Models for their safer deployment.
arXiv Detail & Related papers (2025-06-08T16:40:40Z)
Towards Invisible Backdoor Attack on Text-to-Image Diffusion Model [70.03122709795122]
Backdoor attacks targeting text-to-image diffusion models have advanced rapidly.<n>Current backdoor samples often exhibit two key abnormalities compared to benign samples.<n>We propose a novel Invisible Backdoor Attack (IBA) to enhance the stealthiness of backdoor samples.
arXiv Detail & Related papers (2025-03-22T10:41:46Z)
Gungnir: Exploiting Stylistic Features in Images for Backdoor Attacks on Diffusion Models [8.672029086609884]
Diffusion Models (DMs) are vulnerable to backdoor attacks.<n>Gungnir is a novel method that enables attackers to activate the backdoor in DMs through style triggers within input images.<n>Our technique generates trigger-embedded images that are perceptually indistinguishable from clean images.
arXiv Detail & Related papers (2025-02-28T02:08:26Z)
Proactive Adversarial Defense: Harnessing Prompt Tuning in Vision-Language Models to Detect Unseen Backdoored Images [0.0]
Backdoor attacks pose a critical threat by embedding hidden triggers into inputs, causing models to misclassify them into target labels.<n>We introduce a groundbreaking method to detect unseen backdoored images during both training and inference.<n>Our approach trains learnable text prompts to differentiate clean images from those with hidden backdoor triggers.
arXiv Detail & Related papers (2024-12-11T19:54:14Z)
Revisiting Backdoor Attacks against Large Vision-Language Models from Domain Shift [104.76588209308666]
This paper explores backdoor attacks in LVLM instruction tuning across mismatched training and testing domains.<n>We introduce a new evaluation dimension, backdoor domain generalization, to assess attack robustness.<n>We propose a multimodal attribution backdoor attack (MABA) that injects domain-agnostic triggers into critical areas.
arXiv Detail & Related papers (2024-06-27T02:31:03Z)
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models [65.23688155159398]
Autoregressive Visual Language Models (VLMs) showcase impressive few-shot learning capabilities in a multimodal context. Recently, multimodal instruction tuning has been proposed to further enhance instruction-following abilities. Adversaries can implant a backdoor by injecting poisoned samples with triggers embedded in instructions or images. We propose a multimodal instruction backdoor attack, namely VL-Trojan.
arXiv Detail & Related papers (2024-02-21T14:54:30Z)
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses. We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z)
Hidden Backdoor Attack against Semantic Segmentation Models [60.0327238844584]
The emphbackdoor attack intends to embed hidden backdoors in deep neural networks (DNNs) by poisoning training data. We propose a novel attack paradigm, the emphfine-grained attack, where we treat the target label from the object-level instead of the image-level. Experiments show that the proposed methods can successfully attack semantic segmentation models by poisoning only a small proportion of training data.
arXiv Detail & Related papers (2021-03-06T05:50:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.