Related papers: Backdoor Cleaning without External Guidance in MLLM Fine-tuning

Backdoor Cleaning without External Guidance in MLLM Fine-tuning

URL: http://arxiv.org/abs/2505.16916v1
Date: Thu, 22 May 2025 17:11:58 GMT
Title: Backdoor Cleaning without External Guidance in MLLM Fine-tuning
Authors: Xuankun Rong, Wenke Huang, Jian Liang, Jinhe Bi, Xun Xiao, Yiming Li, Bo Du, Mang Ye,
Abstract summary: Believe Your Eyes (BYE) is a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples.<n>It achieves near-zero attack success rates while maintaining clean-task performance.
Score: 76.82121084745785
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) are increasingly deployed in fine-tuning-as-a-service (FTaaS) settings, where user-submitted datasets adapt general-purpose models to downstream tasks. This flexibility, however, introduces serious security risks, as malicious fine-tuning can implant backdoors into MLLMs with minimal effort. In this paper, we observe that backdoor triggers systematically disrupt cross-modal processing by causing abnormal attention concentration on non-semantic regions--a phenomenon we term attention collapse. Based on this insight, we propose Believe Your Eyes (BYE), a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples. BYE operates via a three-stage pipeline: (1) extracting attention maps using the fine-tuned model, (2) computing entropy scores and profiling sensitive layers via bimodal separation, and (3) performing unsupervised clustering to remove suspicious samples. Unlike prior defenses, BYE equires no clean supervision, auxiliary labels, or model modifications. Extensive experiments across various datasets, models, and diverse trigger types validate BYE's effectiveness: it achieves near-zero attack success rates while maintaining clean-task performance, offering a robust and generalizable solution against backdoor threats in MLLMs.

Related papers

Self-Purification Mitigates Backdoors in Multimodal Diffusion Language Models [74.1970982768771]
We show that well-established data-poisoning pipelines can successfully implant backdoors into MDLMs.<n>We introduce a backdoor defense framework for MDLMs named DiSP (Diffusion Self-Purification)
arXiv Detail & Related papers (2026-02-24T15:47:52Z)
TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning [28.132960300152742]
Fine-Tuning-as-a-Service (FT) facilitates the customization of Multimodal Large Language Models (MLLMs)<n>Existing defenses either rely on supervised signals or fail to generalize across diverse trigger types and modalities.<n>We propose Tri-Component Attention Profiling (TCAP), an unsupervised defense framework to filter backdoor samples.
arXiv Detail & Related papers (2026-01-29T13:26:29Z)
Membership Inference Attacks Against Fine-tuned Diffusion Language Models [14.835693946869178]
Diffusion Language Models (DLMs) are a promising alternative to autoregressive language models.<n>This paper presents the first systematic investigation of Membership Inference Attacks (MIA) vulnerabilities in DLMs.<n>We introduce SAMA (Subset-Aggregated Membership Attack), which addresses the sparse signal challenge through robust aggregation.
arXiv Detail & Related papers (2026-01-27T23:40:07Z)
Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning [89.1856483797116]
We introduce BEAT, the first framework to inject visual backdoors into MLLM-based embodied agents.<n>Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably.<n>BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance.
arXiv Detail & Related papers (2025-10-31T16:50:49Z)
InverTune: Removing Backdoors from Multimodal Contrastive Learning Models via Trigger Inversion and Activation Tuning [36.56302680556252]
We introduce InverTune, the first backdoor defense framework for multimodal models under minimal attacker assumptions.<n>InverTune effectively identifies and removes backdoor artifacts through three key components, achieving robust protection against backdoor attacks.<n> Experimental results show that InverTune reduces the average attack success rate (ASR) by 97.87% against the state-of-the-art (SOTA) attacks.
arXiv Detail & Related papers (2025-06-14T09:08:34Z)
Reasoning Multimodal Large Language Model: Data Contamination and Dynamic Evaluation [9.434966074326056]
Multimodal Large Language Models (MLLMs) show impressive vision-language benchmark performance, yet growing concerns about data contamination risk masking true generalization.<n>We propose a novel dynamic evaluation framework to rigorously assess MLLM generalization, moving beyond static benchmarks.<n>We demonstrate that fine-tuning on simulated test data (extreme contamination) drastically sharpens task-specific performance but harms overall generalization.
arXiv Detail & Related papers (2025-06-08T15:52:38Z)
Detection Method for Prompt Injection by Integrating Pre-trained Model and Heuristic Feature Engineering [3.0823377252469144]
prompt injection attacks have emerged as a significant security threat.<n>Existing defense mechanisms face trade-offs between effectiveness and generalizability.<n>We propose a dual-channel feature fusion detection framework.
arXiv Detail & Related papers (2025-06-05T06:01:19Z)
CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization [7.282200564983221]
Large Language Models (LLMs) are susceptible to backdoor attacks. We introduce Internal Consistency Regularization (CROW) to address layer-wise inconsistencies caused by backdoor triggers. CROW consistently achieves a significant reductions in attack success rates across diverse backdoor strategies and tasks.
arXiv Detail & Related papers (2024-11-18T07:52:12Z)
Revisiting Backdoor Attacks against Large Vision-Language Models from Domain Shift [104.76588209308666]
This paper explores backdoor attacks in LVLM instruction tuning across mismatched training and testing domains.<n>We introduce a new evaluation dimension, backdoor domain generalization, to assess attack robustness.<n>We propose a multimodal attribution backdoor attack (MABA) that injects domain-agnostic triggers into critical areas.
arXiv Detail & Related papers (2024-06-27T02:31:03Z)
Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal. Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths. Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z)
Model X-ray:Detecting Backdoored Models via Decision Boundary [62.675297418960355]
Backdoor attacks pose a significant security vulnerability for deep neural networks (DNNs) We propose Model X-ray, a novel backdoor detection approach based on the analysis of illustrated two-dimensional (2D) decision boundaries. Our approach includes two strategies focused on the decision areas dominated by clean samples and the concentration of label distribution.
arXiv Detail & Related papers (2024-02-27T12:42:07Z)
Effective Backdoor Mitigation in Vision-Language Models Depends on the Pre-training Objective [71.39995120597999]
Modern machine learning models are vulnerable to adversarial and backdoor attacks.<n>Such risks are heightened by the prevalent practice of collecting massive, internet-sourced datasets for training multimodal models.<n>CleanCLIP is the current state-of-the-art approach to mitigate the effects of backdooring in multimodal models.
arXiv Detail & Related papers (2023-11-25T06:55:13Z)
CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning [63.72975421109622]
CleanCLIP is a finetuning framework that weakens the learned spurious associations introduced by backdoor attacks. CleanCLIP maintains model performance on benign examples while erasing a range of backdoor attacks on multimodal contrastive learning.
arXiv Detail & Related papers (2023-03-06T17:48:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.