Related papers: Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models

Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models

URL: http://arxiv.org/abs/2601.10313v1
Date: Thu, 15 Jan 2026 11:45:56 GMT
Title: Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models
Authors: Peng-Fei Zhang, Zi Huang,
Abstract summary: HRA refines universal adversarial perturbations (UAPs) at both the sample level and the optimization level.<n>For the image modality, we disentangle adversarial examples into clean images and perturbations, allowing each component to be handled independently.<n>For the text modality, HRA identifies globally influential words by combining intra-sentence and inter-sentence importance measures.
Score: 41.79238283279954
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing adversarial attacks for VLP models are mostly sample-specific, resulting in substantial computational overhead when scaled to large datasets or new scenarios. To overcome this limitation, we propose Hierarchical Refinement Attack (HRA), a multimodal universal attack framework for VLP models. HRA refines universal adversarial perturbations (UAPs) at both the sample level and the optimization level. For the image modality, we disentangle adversarial examples into clean images and perturbations, allowing each component to be handled independently for more effective disruption of cross-modal alignment. We further introduce a ScMix augmentation strategy that diversifies visual contexts and strengthens both global and local utility of UAPs, thereby reducing reliance on spurious features. In addition, we refine the optimization path by leveraging a temporal hierarchy of historical and estimated future gradients to avoid local minima and stabilize universal perturbation learning. For the text modality, HRA identifies globally influential words by combining intra-sentence and inter-sentence importance measures, and subsequently utilizes these words as universal text perturbations. Extensive experiments across various downstream tasks, VLP models, and datasets demonstrate the superiority of the proposed universal multimodal attacks.

Related papers

Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models [67.45032003041399]
We propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs.<n>MPCO adaptively balances the importance of different paradigm representations and guides the global optimisation.<n>Our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs.
arXiv Detail & Related papers (2026-03-05T06:01:26Z)
OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL [63.388513841293616]
Existing forgery detection methods fail to handle the interleaved text, images, and videos prevalent in real-world misinformation.<n>To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding.<n>We propose textbf OmniVL-Guard, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding.
arXiv Detail & Related papers (2026-02-11T09:41:36Z)
A Two-Stage Globally-Diverse Adversarial Attack for Vision-Language Pre-training Models [3.9965186683223606]
Existing multimodal attacks often suffer from limited perturbation diversity and unstable multi-stage pipelines.<n>We propose 2S-GDA, a two-stage globally-diverse attack framework.<n>Our framework is modular and can be easily combined with existing methods to further enhance adversarial transferability.
arXiv Detail & Related papers (2026-01-18T08:05:33Z)
Unified Enhancement of the Generalization and Robustness of Language Models via Bi-Stage Optimization [2.502393972789905]
We propose a bi-stage optimization framework to uniformly enhance both the generalization and robustness of LMs.<n>We show that our method significantly improves the generalization and robustness of LMs compared to other existing methods.
arXiv Detail & Related papers (2025-03-19T13:50:36Z)
Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models [58.936893810674896]
Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems.<n>We introduce a multimodal large language model framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS)<n>We propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images.
arXiv Detail & Related papers (2025-01-03T09:25:04Z)
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality [69.76121008898677]
Fine-grained Selective Calibrated CLIP integrates local hard negative loss and selective calibrated regularization. Our evaluations show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities.
arXiv Detail & Related papers (2024-10-07T17:16:20Z)
Sample-agnostic Adversarial Perturbation for Vision-Language Pre-training Models [7.350203999073509]
Recent studies on AI security have highlighted the vulnerability of Vision-Language Pre-training models to subtle yet intentionally designed perturbations in images and texts. To the best of our knowledge, it is the first work through multimodal decision boundaries to explore the creation of a universal, sample-agnostic perturbation that applies to any image.
arXiv Detail & Related papers (2024-08-06T06:25:39Z)
Universal Adversarial Perturbations for Vision-Language Pre-trained Models [30.04163729936878]
We propose a novel black-box method to generate Universal Adversarial Perturbations (UAPs) The ETU takes into account the characteristics of UAPs and the intrinsic cross-modal interactions to generate effective UAPs. To further enhance the effectiveness and transferability of UAPs, we also design a novel data augmentation method named ScMix.
arXiv Detail & Related papers (2024-05-09T03:27:28Z)
A Novel Cross-Perturbation for Single Domain Generalization [54.612933105967606]
Single domain generalization aims to enhance the ability of the model to generalize to unknown domains when trained on a single source domain. The limited diversity in the training data hampers the learning of domain-invariant features, resulting in compromised generalization performance. We propose CPerb, a simple yet effective cross-perturbation method to enhance the diversity of the training data.
arXiv Detail & Related papers (2023-08-02T03:16:12Z)
Enhancing the Self-Universality for Transferable Targeted Attacks [88.6081640779354]
Our new attack method is proposed based on the observation that highly universal adversarial perturbations tend to be more transferable for targeted attacks. Instead of optimizing the perturbations on different images, optimizing on different regions to achieve self-universality can get rid of using extra data. With the feature similarity loss, our method makes the features from adversarial perturbations to be more dominant than that of benign images.
arXiv Detail & Related papers (2022-09-08T11:21:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.