TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning
- URL: http://arxiv.org/abs/2601.21692v1
- Date: Thu, 29 Jan 2026 13:26:29 GMT
- Title: TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning
- Authors: Mingzu Liu, Hao Fang, Runmin Cong,
- Abstract summary: Fine-Tuning-as-a-Service (FT) facilitates the customization of Multimodal Large Language Models (MLLMs)<n>Existing defenses either rely on supervised signals or fail to generalize across diverse trigger types and modalities.<n>We propose Tri-Component Attention Profiling (TCAP), an unsupervised defense framework to filter backdoor samples.
- Score: 28.132960300152742
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-Tuning-as-a-Service (FTaaS) facilitates the customization of Multimodal Large Language Models (MLLMs) but introduces critical backdoor risks via poisoned data. Existing defenses either rely on supervised signals or fail to generalize across diverse trigger types and modalities. In this work, we uncover a universal backdoor fingerprint-attention allocation divergence-where poisoned samples disrupt the balanced attention distribution across three functional components: system instructions, vision inputs, and user textual queries, regardless of trigger morphology. Motivated by this insight, we propose Tri-Component Attention Profiling (TCAP), an unsupervised defense framework to filter backdoor samples. TCAP decomposes cross-modal attention maps into the three components, identifies trigger-responsive attention heads via Gaussian Mixture Model (GMM) statistical profiling, and isolates poisoned samples through EM-based vote aggregation. Extensive experiments across diverse MLLM architectures and attack methods demonstrate that TCAP achieves consistently strong performance, establishing it as a robust and practical backdoor defense in MLLMs.
Related papers
- TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models [19.148124494194317]
We propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls.<n>Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy.<n>We demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive.
arXiv Detail & Related papers (2026-03-02T22:19:13Z) - Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models [50.91504059485288]
We propose a framework that identifies safety-critical attention heads through global optimization over all heads simultaneously.<n>We develop a novel inference-time white-box jailbreak method that exploits the identified safety vectors through activation repatching.
arXiv Detail & Related papers (2026-01-22T09:32:43Z) - Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning [89.1856483797116]
We introduce BEAT, the first framework to inject visual backdoors into MLLM-based embodied agents.<n>Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably.<n>BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance.
arXiv Detail & Related papers (2025-10-31T16:50:49Z) - RHINO: Guided Reasoning for Mapping Network Logs to Adversarial Tactics and Techniques with Large Language Models [9.065322387043546]
We introduce RHINO, a framework that decomposes Large Language Models into three interpretable phases mirroring human reasoning.<n> RHINO bridges the semantic gap between low-level observations and adversarial intent while improving output reliability through structured reasoning.<n>Our results demonstrate that RHINO significantly enhances the interpretability and scalability of threat analysis, offering a blueprint for deploying LLMs in operational security settings.
arXiv Detail & Related papers (2025-10-16T02:25:46Z) - MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents [12.160412894251406]
MicroRCA-Agent is an innovative solution for microservice root cause analysis based on large language model agents.<n>The proposed solution demonstrates superior performance in complex microservice fault scenarios, achieving a final score of 50.71.
arXiv Detail & Related papers (2025-09-19T05:57:03Z) - Detection Method for Prompt Injection by Integrating Pre-trained Model and Heuristic Feature Engineering [3.0823377252469144]
prompt injection attacks have emerged as a significant security threat.<n>Existing defense mechanisms face trade-offs between effectiveness and generalizability.<n>We propose a dual-channel feature fusion detection framework.
arXiv Detail & Related papers (2025-06-05T06:01:19Z) - Backdoor Cleaning without External Guidance in MLLM Fine-tuning [76.82121084745785]
Believe Your Eyes (BYE) is a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples.<n>It achieves near-zero attack success rates while maintaining clean-task performance.
arXiv Detail & Related papers (2025-05-22T17:11:58Z) - SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models [61.8876114116716]
Multimodal large language models (MLLMs) have demonstrated strong capabilities in vision-related tasks.<n>However, their ability to detect subtle visual spoofing and forgery clues in face attack detection tasks remains underexplored.<n>We introduce a benchmark, SHIELD, to evaluate MLLMs for face spoofing and forgery detection.
arXiv Detail & Related papers (2024-02-06T17:31:36Z) - Setting the Trap: Capturing and Defeating Backdoors in Pretrained
Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks.
We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively.
Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z) - CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive
Learning [63.72975421109622]
CleanCLIP is a finetuning framework that weakens the learned spurious associations introduced by backdoor attacks.
CleanCLIP maintains model performance on benign examples while erasing a range of backdoor attacks on multimodal contrastive learning.
arXiv Detail & Related papers (2023-03-06T17:48:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.