Related papers: MBA: Multimodal Bidirectional Attack for Referring Expression Segmentation Models

MBA: Multimodal Bidirectional Attack for Referring Expression Segmentation Models

URL: http://arxiv.org/abs/2506.16157v1
Date: Thu, 19 Jun 2025 09:14:04 GMT
Title: MBA: Multimodal Bidirectional Attack for Referring Expression Segmentation Models
Authors: Xingbai Chen, Tingchao Fu, Renyang Liu, Wei Zhou, Chao Yi,
Abstract summary: Referring Expression (RES) enables precise object segmentation in images based on natural language descriptions.<n>Despite its impressive performance, the robustness of RES models against adversarial examples remains largely unexplored.<n>We propose a novel adversarial attack strategy termed textbfMultimodal Bidirectional Attack, tailored for RES models.
Score: 2.5931446496646204
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring Expression Segmentation (RES) enables precise object segmentation in images based on natural language descriptions, offering high flexibility and broad applicability in real-world vision tasks. Despite its impressive performance, the robustness of RES models against adversarial examples remains largely unexplored. While prior adversarial attack methods have explored adversarial robustness on conventional segmentation models, they perform poorly when directly applied to RES, failing to expose vulnerabilities in its multimodal structure. Moreover, in practical open-world scenarios, users typically issue multiple, diverse referring expressions to interact with the same image, highlighting the need for adversarial examples that generalize across varied textual inputs. To address these multimodal challenges, we propose a novel adversarial attack strategy termed \textbf{Multimodal Bidirectional Attack}, tailored for RES models. Our method introduces learnable proxy textual embedding perturbation and jointly performs visual-aligned optimization on the image modality and textual-adversarial optimization on the textual modality during attack generation. This dual optimization framework encourages adversarial images to actively adapt to more challenging text embedding during optimization, thereby enhancing their cross-text transferability, which refers to the ability of adversarial examples to remain effective under a variety of unseen or semantically diverse textual inputs. Extensive experiments conducted on multiple RES models and benchmark datasets demonstrate the superior effectiveness of our method compared to existing methods.

Related papers

CREM: Compression-Driven Representation Enhancement for Multimodal Retrieval and Comprehension [49.6969505536365]
We propose CREM, with a unified framework that enhances multimodal representations for retrieval while preserving generative ability.<n>CREM achieves state-of-the-art retrieval performance on MMEB while maintaining strong generative performance on multiple comprehension benchmarks.
arXiv Detail & Related papers (2026-02-22T08:09:51Z)
What Makes You Unique? Attribute Prompt Composition for Object Re-Identification [70.67907354506278]
Object Re-IDentification aims to recognize individuals across non-overlapping camera views.<n>Single-domain models tend to overfit to domain-specific features, whereas cross-domain models often rely on diverse normalization strategies.<n>We propose an Attribute Prompt Composition framework, which exploits textual semantics to jointly enhance discrimination and generalization.
arXiv Detail & Related papers (2025-09-23T07:03:08Z)
CogDual: Enhancing Dual Cognition of LLMs via Reinforcement Learning with Implicit Rule-Based Rewards [53.36917093757101]
Role-Playing Language Agents (RPLAs) have emerged as a significant application direction for Large Language Models (LLMs)<n>We introduce textbfCogDual, a novel RPLA adopting a textitcognize-then-respond reasoning paradigm.<n>By jointly modeling external situational awareness and internal self-awareness, CogDual generates responses with improved character consistency and contextual alignment.
arXiv Detail & Related papers (2025-07-23T02:26:33Z)
Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning [28.15997901023315]
Recall is a novel adversarial framework designed to compromise the robustness of unlearned IGMs.<n>It consistently outperforms existing baselines in terms of adversarial effectiveness, computational efficiency, and semantic fidelity with the original prompt.<n>These findings reveal critical vulnerabilities in current unlearning mechanisms and underscore the need for more robust solutions.
arXiv Detail & Related papers (2025-07-09T02:59:01Z)
DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy [20.581841892290672]
We propose DeRIS, a novel framework that decomposes RIS into two key components: perception and cognition.<n>Our findings reveal that the predominant limitation lies not in perceptual deficiencies, but in the insufficient multi-modal cognitive capacity of current models.<n>We introduce a simple non-referent sample conversion data augmentation to address the long-tail distribution issue related to target existence judgement.
arXiv Detail & Related papers (2025-07-02T14:14:35Z)
EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications [24.832537917472894]
EVADE is the first expert-curated, Chinese, multimodal benchmark designed to evaluate foundation models on evasive content detection in e-commerce.<n>The dataset contains 2,833 annotated text samples and 13,961 images spanning six demanding product categories.
arXiv Detail & Related papers (2025-05-23T09:18:01Z)
Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models [58.936893810674896]
Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems.<n>We introduce a multimodal large language model framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS)<n>We propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images.
arXiv Detail & Related papers (2025-01-03T09:25:04Z)
Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack [51.16384207202798]
Vision-language pre-training models are vulnerable to multimodal adversarial examples (AEs) Previous approaches augment image-text pairs to enhance diversity within the adversarial example generation process. We propose sampling from adversarial evolution triangles composed of clean, historical, and current adversarial examples to enhance adversarial diversity.
arXiv Detail & Related papers (2024-11-04T23:07:51Z)
Generative Adversarial Patches for Physical Attacks on Cross-Modal Pedestrian Re-Identification [24.962600785183582]
Visible-infrared pedestrian Re-identification (VI-ReID) aims to match pedestrian images captured by infrared cameras and visible cameras. This paper introduces the first physical adversarial attack against VI-ReID models.
arXiv Detail & Related papers (2024-10-26T06:40:10Z)
Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models [32.23201683108716]
We propose a novel strategy that exclusively employs image patches for attacks, thus preserving the integrity of the original text. Our method leverages prior knowledge from diffusion models to enhance the authenticity and naturalness of the perturbations. Comprehensive experiments conducted in a white-box setting for image-to-text scenarios reveal that our proposed method significantly outperforms existing techniques, achieving a 100% attack success rate.
arXiv Detail & Related papers (2024-10-07T10:06:01Z)
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models. Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs. Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z)
Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory [8.591762884862504]
Vision-language pre-training models are susceptible to multimodal adversarial examples (AEs) We propose using diversification along the intersection region of adversarial trajectory to expand the diversity of AEs. To further mitigate the potential overfitting, we direct the adversarial text deviating from the last intersection region along the optimization path.
arXiv Detail & Related papers (2024-03-19T05:10:10Z)
Mutual-modality Adversarial Attack with Semantic Perturbation [81.66172089175346]
We propose a novel approach that generates adversarial attacks in a mutual-modality optimization scheme. Our approach outperforms state-of-the-art attack methods and can be readily deployed as a plug-and-play solution.
arXiv Detail & Related papers (2023-12-20T05:06:01Z)
Model Stealing Attack against Graph Classification with Authenticity, Uncertainty and Diversity [80.16488817177182]
GNNs are vulnerable to the model stealing attack, a nefarious endeavor geared towards duplicating the target model via query permissions. We introduce three model stealing attacks to adapt to different actual scenarios.
arXiv Detail & Related papers (2023-12-18T05:42:31Z)
SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios. We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z)
OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization [65.57380193070574]
Vision-language pre-training models are vulnerable to multi-modal adversarial examples. Recent works have indicated that leveraging data augmentation and image-text modal interactions can enhance the transferability of adversarial examples. We propose an Optimal Transport-based Adversarial Attack, dubbed OT-Attack.
arXiv Detail & Related papers (2023-12-07T16:16:50Z)
Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence [48.659338080020746]
Referring image segmentation (RIS) aims to segment objects in an image conditioning on free-from text descriptions. We present a novel RIS approach, which substantially improves the generalization ability by addressing the two dilemmas mentioned above. Specially, to deal with unconstrained texts, we propose to boost a given expression with an explicit and crucial prompt, which complements the expression in a unified context.
arXiv Detail & Related papers (2023-12-01T09:31:24Z)
Iterative Adversarial Attack on Image-guided Story Ending Generation [37.42908817585858]
Multimodal learning involves developing models that can integrate information from various sources like images and texts. Deep neural networks, which are the backbone of recent IgSEG models, are vulnerable to adversarial samples. We propose an iterative adversarial attack method (Iterative-attack) that fuses image and text modality attacks.
arXiv Detail & Related papers (2023-05-16T06:19:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.