Bridging the Task Gap: Multi-Task Adversarial Transferability in CLIP and Its Derivatives
- URL: http://arxiv.org/abs/2509.23917v1
- Date: Sun, 28 Sep 2025 14:46:52 GMT
- Title: Bridging the Task Gap: Multi-Task Adversarial Transferability in CLIP and Its Derivatives
- Authors: Kuanrong Liu, Siyuan Liang, Cheng Qian, Ming Zhang, Xiaochun Cao,
- Abstract summary: Adversarial examples generated from fine-grained tasks often exhibit stronger transfer potential than those from coarse-grained tasks.<n>We propose a novel framework, Multi-Task Adversarial CLIP (MT-AdvCLIP), which introduces a task-aware feature aggregation loss and generates perturbations with enhanced cross-task generalization capability.
- Score: 61.58574200236532
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As a general-purpose vision-language pretraining model, CLIP demonstrates strong generalization ability in image-text alignment tasks and has been widely adopted in downstream applications such as image classification and image-text retrieval. However, it struggles with fine-grained tasks such as object detection and semantic segmentation. While many variants aim to improve CLIP on these tasks, its robustness to adversarial perturbations remains underexplored. Understanding how adversarial examples transfer across tasks is key to assessing CLIP's generalization limits and security risks. In this work, we conduct a systematic empirical analysis of the cross-task transfer behavior of CLIP-based models on image-text retrieval, object detection, and semantic segmentation under adversarial perturbations. We find that adversarial examples generated from fine-grained tasks (e.g., object detection and semantic segmentation) often exhibit stronger transfer potential than those from coarse-grained tasks, enabling more effective attacks against the original CLIP model. Motivated by this observation, we propose a novel framework, Multi-Task Adversarial CLIP (MT-AdvCLIP), which introduces a task-aware feature aggregation loss and generates perturbations with enhanced cross-task generalization capability. This design strengthens the attack effectiveness of fine-grained task models on the shared CLIP backbone. Experimental results on multiple public datasets show that MT-AdvCLIP significantly improves the adversarial transfer success rate (The average attack success rate across multiple tasks is improved by over 39%.) against various CLIP-derived models, without increasing the perturbation budget. This study reveals the transfer mechanism of adversarial examples in multi-task CLIP models, offering new insights into multi-task robustness evaluation and adversarial example design.
Related papers
- CLIP-Guided Adaptable Self-Supervised Learning for Human-Centric Visual Tasks [76.00315860962885]
We propose CLASP (CLIP-guided Adaptable Self-suPervised learning), a novel framework for unsupervised pre-training in human-centric visual tasks.<n> CLASP leverages the powerful vision-language model CLIP to generate both low-level (e.g., body parts) and high-level (e.g., attributes) semantic pseudo-labels.<n>MoE dynamically adapts feature extraction based on task-specific prompts, mitigating potential feature conflicts and enhancing transferability.
arXiv Detail & Related papers (2026-01-19T15:19:28Z) - A Generative Adversarial Approach to Adversarial Attacks Guided by Contrastive Language-Image Pre-trained Model [12.15621649989295]
A generative adversarial attack method is proposed that uses the CLIP model to create highly effective and visually imperceptible adversarial perturbations.<n>Our approach integrates the concentrated perturbation strategy from Saliency-based Auto-Encoder with the dissimilar text embeddings similar to Generative Adversarial Multi-Object Scene Attacks (GAMA)
arXiv Detail & Related papers (2025-11-03T08:02:48Z) - MEJO: MLLM-Engaged Surgical Triplet Recognition via Inter- and Intra-Task Joint Optimization [52.149337961205624]
We propose a framework that empowers both inter- and intra-task optimization for surgical triplet recognition.<n>For inter-task optimization, we introduce the Shared-Specific-Disentangled (S$2$D) learning scheme that decomposes representations into task-shared and task-specific components.<n>For intra-task optimization conflicts, we develop a Coordinated Gradient Learning (CGL) strategy, which dissects and rebalances the positive-negative ambiguities.
arXiv Detail & Related papers (2025-09-16T09:48:52Z) - One Object, Multiple Lies: A Benchmark for Cross-task Adversarial Attack on Unified Vision-Language Models [19.705340191553496]
Unified vision-language models (VLMs) can address diverse tasks through different instructions within a shared computational architecture.<n> adversarial inputs must remain effective across multiple task instructions that may be unpredictably applied to process the same malicious content.<n>In this paper, we introduce CrossVLAD, a new benchmark dataset for evaluating cross-task adversarial attacks on unified VLMs.
arXiv Detail & Related papers (2025-07-10T12:40:34Z) - Harnessing the Computation Redundancy in ViTs to Boost Adversarial Transferability [38.32538271219404]
We investigate the role of computational redundancy in Vision Transformers (ViTs) and its impact on adversarial transferability.<n>We identify two forms of redundancy, including the data-level and model-level, that can be harnessed to amplify attack effectiveness.<n>Building on this insight, we design a suite of techniques, including attention sparsity manipulation, attention head permutation, clean token regularization, ghost MoE diversification, and test-time adversarial training.
arXiv Detail & Related papers (2025-04-15T01:59:47Z) - Semi-supervised Semantic Segmentation for Remote Sensing Images via Multi-scale Uncertainty Consistency and Cross-Teacher-Student Attention [59.19580789952102]
This paper proposes a novel semi-supervised Multi-Scale Uncertainty and Cross-Teacher-Student Attention (MUCA) model for RS image semantic segmentation tasks.<n>MUCA constrains the consistency among feature maps at different layers of the network by introducing a multi-scale uncertainty consistency regularization.<n>MUCA utilizes a Cross-Teacher-Student attention mechanism to guide the student network, guiding the student network to construct more discriminative feature representations.
arXiv Detail & Related papers (2025-01-18T11:57:20Z) - Improving Adversarial Transferability of Vision-Language Pre-training Models through Collaborative Multimodal Interaction [22.393624206051925]
Existing work rarely studies the transferability of attacks on Vision-Language Pre-training models.
We propose a novel attack, called Collaborative Multimodal Interaction Attack (CMI-Attack)
CMI-Attack raises the transfer success rates from ALBEF to TCL, $textCLIP_textViT$ and $textCLIP_textCNN$ by 8.11%-16.75% over state-of-the-art methods.
arXiv Detail & Related papers (2024-03-16T10:32:24Z) - Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation Model [86.9619638550683]
Vision-language foundation models have exhibited remarkable success across a multitude of downstream tasks due to their scalability on extensive image-text paired data.<n>However, these models display significant limitations when applied to downstream tasks, such as fine-grained image classification, as a result of decision shortcuts''
arXiv Detail & Related papers (2024-03-01T09:01:53Z) - CT-GAT: Cross-Task Generative Adversarial Attack based on
Transferability [24.272384832200522]
We propose a novel approach that directly constructs adversarial examples by extracting transferable features across various tasks.
Specifically, we train a sequence-to-sequence generative model named CT-GAT using adversarial sample data collected from multiple tasks to acquire universal adversarial features.
Results demonstrate that our method achieves superior attack performance with small cost.
arXiv Detail & Related papers (2023-10-22T11:00:04Z) - Set-level Guidance Attack: Boosting Adversarial Transferability of
Vision-Language Pre-training Models [52.530286579915284]
We present the first study to investigate the adversarial transferability of vision-language pre-training models.
The transferability degradation is partly caused by the under-utilization of cross-modal interactions.
We propose a highly transferable Set-level Guidance Attack (SGA) that thoroughly leverages modality interactions and incorporates alignment-preserving augmentation with cross-modal guidance.
arXiv Detail & Related papers (2023-07-26T09:19:21Z) - CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive
Learning [63.72975421109622]
CleanCLIP is a finetuning framework that weakens the learned spurious associations introduced by backdoor attacks.
CleanCLIP maintains model performance on benign examples while erasing a range of backdoor attacks on multimodal contrastive learning.
arXiv Detail & Related papers (2023-03-06T17:48:32Z) - Task-Feature Collaborative Learning with Application to Personalized
Attribute Prediction [166.87111665908333]
We propose a novel multi-task learning method called Task-Feature Collaborative Learning (TFCL)
Specifically, we first propose a base model with a heterogeneous block-diagonal structure regularizer to leverage the collaborative grouping of features and tasks.
As a practical extension, we extend the base model by allowing overlapping features and differentiating the hard tasks.
arXiv Detail & Related papers (2020-04-29T02:32:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.