Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates
- URL: http://arxiv.org/abs/2505.22943v1
- Date: Wed, 28 May 2025 23:45:55 GMT
- Title: Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates
- Authors: Jaewoo Ahn, Heeseung Yun, Dayoon Ko, Gunhee Kim,
- Abstract summary: We introduce Multimodal Adversarial Compositionality (MAC) to generate deceptive text samples.<n>We evaluate them through both sample-wise attack success rate and group-wise entropy-based diversity.<n>Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities.
- Score: 37.65554922794508
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.
Related papers
- MBA: Multimodal Bidirectional Attack for Referring Expression Segmentation Models [2.5931446496646204]
Referring Expression (RES) enables precise object segmentation in images based on natural language descriptions.<n>Despite its impressive performance, the robustness of RES models against adversarial examples remains largely unexplored.<n>We propose a novel adversarial attack strategy termed textbfMultimodal Bidirectional Attack, tailored for RES models.
arXiv Detail & Related papers (2025-06-19T09:14:04Z) - Enhancing Adversarial Robustness of Vision Language Models via Adversarial Mixture Prompt Tuning [14.714776642137247]
Adversarial Mixture Prompt Tuning (AMPT) aims to learn mixture text prompts to obtain more robust text features.<n>A series of experiments show that our method can achieve better adversarial robustness than state-of-the-art methods.
arXiv Detail & Related papers (2025-05-23T06:04:15Z) - Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs [28.20725794099928]
We present UniME, a novel framework that learns discriminative representations for diverse downstream tasks.<n>In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model.<n>In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning.
arXiv Detail & Related papers (2025-04-24T10:51:52Z) - Robust image classification with multi-modal large language models [4.709926629434273]
adversarial examples can cause Deep Neural Networks to make incorrect predictions with high confidence.<n>To mitigate these vulnerabilities, adversarial training and detection-based defenses have been proposed to strengthen models in advance.<n>We propose a novel defense, MultiShield, designed to combine and complement these defenses with multi-modal information.
arXiv Detail & Related papers (2024-12-13T18:49:25Z) - Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality [69.76121008898677]
Fine-grained Selective Calibrated CLIP integrates local hard negative loss and selective calibrated regularization.
Our evaluations show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities.
arXiv Detail & Related papers (2024-10-07T17:16:20Z) - M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning [90.75075886543404]
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains.
In this work, we introduce a novel Multimodal Prompt Tuning (M$2$PT) approach for efficient instruction tuning of MLLMs.
arXiv Detail & Related papers (2024-09-24T01:40:24Z) - Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach [30.9778838504609]
Vision-language pretraining with transformers has demonstrated exceptional performance across numerous multimodal tasks.
Existing multimodal attack methods have largely overlooked cross-modal interactions between visual and textual modalities.
We propose a novel Joint Multimodal Transformer Feature Attack (JMTFA) that concurrently introduces adversarial perturbations in both visual and textual modalities.
arXiv Detail & Related papers (2024-08-24T04:31:37Z) - Beyond DAGs: A Latent Partial Causal Model for Multimodal Learning [80.44084021062105]
We propose a novel latent partial causal model for multimodal data, featuring two latent coupled variables, connected by an undirected edge, to represent the transfer of knowledge across modalities.<n>Under specific statistical assumptions, we establish an identifiability result, demonstrating that representations learned by multimodal contrastive learning correspond to the latent coupled variables up to a trivial transformation.<n>Experiments on a pre-trained CLIP model embodies disentangled representations, enabling few-shot learning and improving domain generalization across diverse real-world datasets.
arXiv Detail & Related papers (2024-02-09T07:18:06Z) - SA-Attack: Improving Adversarial Transferability of Vision-Language
Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios.
We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z) - Set-level Guidance Attack: Boosting Adversarial Transferability of
Vision-Language Pre-training Models [52.530286579915284]
We present the first study to investigate the adversarial transferability of vision-language pre-training models.
The transferability degradation is partly caused by the under-utilization of cross-modal interactions.
We propose a highly transferable Set-level Guidance Attack (SGA) that thoroughly leverages modality interactions and incorporates alignment-preserving augmentation with cross-modal guidance.
arXiv Detail & Related papers (2023-07-26T09:19:21Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.