Related papers: Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

URL: http://arxiv.org/abs/2505.22943v1
Date: Wed, 28 May 2025 23:45:55 GMT
Title: Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates
Authors: Jaewoo Ahn, Heeseung Yun, Dayoon Ko, Gunhee Kim,
Abstract summary: We introduce Multimodal Adversarial Compositionality (MAC) to generate deceptive text samples.<n>We evaluate them through both sample-wise attack success rate and group-wise entropy-based diversity.<n>Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities.
Score: 37.65554922794508
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.

Related papers

MBA: Multimodal Bidirectional Attack for Referring Expression Segmentation Models [2.5931446496646204]
Referring Expression (RES) enables precise object segmentation in images based on natural language descriptions.<n>Despite its impressive performance, the robustness of RES models against adversarial examples remains largely unexplored.<n>We propose a novel adversarial attack strategy termed textbfMultimodal Bidirectional Attack, tailored for RES models.
arXiv Detail & Related papers (2025-06-19T09:14:04Z)
Enhancing Adversarial Robustness of Vision Language Models via Adversarial Mixture Prompt Tuning [14.714776642137247]
Adversarial Mixture Prompt Tuning (AMPT) aims to learn mixture text prompts to obtain more robust text features.<n>A series of experiments show that our method can achieve better adversarial robustness than state-of-the-art methods.
arXiv Detail & Related papers (2025-05-23T06:04:15Z)
Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs [28.20725794099928]
We present UniME, a novel framework that learns discriminative representations for diverse downstream tasks.<n>In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model.<n>In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning.
arXiv Detail & Related papers (2025-04-24T10:51:52Z)
Robust image classification with multi-modal large language models [4.709926629434273]
adversarial examples can cause Deep Neural Networks to make incorrect predictions with high confidence.<n>To mitigate these vulnerabilities, adversarial training and detection-based defenses have been proposed to strengthen models in advance.<n>We propose a novel defense, MultiShield, designed to combine and complement these defenses with multi-modal information.
arXiv Detail & Related papers (2024-12-13T18:49:25Z)
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality [69.76121008898677]
Fine-grained Selective Calibrated CLIP integrates local hard negative loss and selective calibrated regularization. Our evaluations show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities.
arXiv Detail & Related papers (2024-10-07T17:16:20Z)
M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning [90.75075886543404]
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains. In this work, we introduce a novel Multimodal Prompt Tuning (M$2$PT) approach for efficient instruction tuning of MLLMs.
arXiv Detail & Related papers (2024-09-24T01:40:24Z)
Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach [30.9778838504609]
Vision-language pretraining with transformers has demonstrated exceptional performance across numerous multimodal tasks. Existing multimodal attack methods have largely overlooked cross-modal interactions between visual and textual modalities. We propose a novel Joint Multimodal Transformer Feature Attack (JMTFA) that concurrently introduces adversarial perturbations in both visual and textual modalities.
arXiv Detail & Related papers (2024-08-24T04:31:37Z)
Beyond DAGs: A Latent Partial Causal Model for Multimodal Learning [80.44084021062105]
We propose a novel latent partial causal model for multimodal data, featuring two latent coupled variables, connected by an undirected edge, to represent the transfer of knowledge across modalities.<n>Under specific statistical assumptions, we establish an identifiability result, demonstrating that representations learned by multimodal contrastive learning correspond to the latent coupled variables up to a trivial transformation.<n>Experiments on a pre-trained CLIP model embodies disentangled representations, enabling few-shot learning and improving domain generalization across diverse real-world datasets.
arXiv Detail & Related papers (2024-02-09T07:18:06Z)
SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios. We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z)
Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models [52.530286579915284]
We present the first study to investigate the adversarial transferability of vision-language pre-training models. The transferability degradation is partly caused by the under-utilization of cross-modal interactions. We propose a highly transferable Set-level Guidance Attack (SGA) that thoroughly leverages modality interactions and incorporates alignment-preserving augmentation with cross-modal guidance.
arXiv Detail & Related papers (2023-07-26T09:19:21Z)
Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.