CaptionFool: Universal Image Captioning Model Attacks
- URL: http://arxiv.org/abs/2603.00529v1
- Date: Sat, 28 Feb 2026 07:57:23 GMT
- Title: CaptionFool: Universal Image Captioning Model Attacks
- Authors: Swapnil Parekh,
- Abstract summary: We present CaptionFool, a novel adversarial attack against state-of-the-art transformer-based captioning models.<n>Our attack achieves 94-96% success rate in generating arbitrary target captions, including offensive content.
- Score: 1.3011345529764784
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image captioning models are encoder-decoder architectures trained on large-scale image-text datasets, making them susceptible to adversarial attacks. We present CaptionFool, a novel universal (input-agnostic) adversarial attack against state-of-the-art transformer-based captioning models. By modifying only 7 out of 577 image patches (approximately 1.2% of the image), our attack achieves 94-96% success rate in generating arbitrary target captions, including offensive content. We further demonstrate that CaptionFool can generate "slang" terms specifically designed to evade existing content moderation filters. Our findings expose critical vulnerabilities in deployed vision-language models and underscore the urgent need for robust defenses against such attacks. Warning: This paper contains model outputs which are offensive in nature.
Related papers
- Adapter Shield: A Unified Framework with Built-in Authentication for Preventing Unauthorized Zero-Shot Image-to-Image Generation [74.5813283875938]
Zero-shot image-to-image generation poses substantial risks related to intellectual property violations.<n>This work presents Adapter Shield, the first universal and authentication-integrated solution aimed at defending personal images from misuse.<n>Our method surpasses existing state-of-the-art defenses in blocking unauthorized zero-shot image synthesis.
arXiv Detail & Related papers (2025-11-25T04:49:16Z) - SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs [57.880467106470775]
Attackers can inject imperceptible perturbations into the training data, causing the model to generate malicious, attacker-controlled captions.<n>We propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without prior knowledge of triggers.<n>SRD uses a Deep Q-Network to learn policies for applying discrete perturbations to sensitive image regions, aiming to disrupt the activation of malicious pathways.
arXiv Detail & Related papers (2025-06-05T08:22:24Z) - Web Artifact Attacks Disrupt Vision Language Models [61.59021920232986]
Vision-language models (VLMs) are trained on large-scale, lightly curated web datasets.<n>They learn unintended correlations between semantic concepts and unrelated visual signals.<n>Prior work has weaponized these correlations as an attack vector to manipulate model predictions.<n>We introduce "artifact-based" attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements.
arXiv Detail & Related papers (2025-03-17T18:59:29Z) - Typographic Attacks in a Multi-Image Setting [2.9154316123656927]
We introduce a multi-image setting for studying typographic attacks.<n>Specifically, our focus is on attacking image sets without repeating the attack query.<n>We introduce two attack strategies for the multi-image setting, leveraging the difficulty of the target image, the strength of the attack text, and text-image similarity.
arXiv Detail & Related papers (2025-02-12T08:10:25Z) - MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models.
Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs.
Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z) - Stealthy Targeted Backdoor Attacks against Image Captioning [16.409633596670368]
We present a novel method to craft targeted backdoor attacks against image caption models.
Our method first learns a special trigger by leveraging universal perturbation techniques for object detection.
Our approach can achieve a high attack success rate while having a negligible impact on model clean performance.
arXiv Detail & Related papers (2024-06-09T18:11:06Z) - BAGM: A Backdoor Attack for Manipulating Text-to-Image Generative Models [54.19289900203071]
The rise in popularity of text-to-image generative artificial intelligence has attracted widespread public interest.
We demonstrate that this technology can be attacked to generate content that subtly manipulates its users.
We propose a Backdoor Attack on text-to-image Generative Models (BAGM)
Our attack is the first to target three popular text-to-image generative models across three stages of the generative process.
arXiv Detail & Related papers (2023-07-31T08:34:24Z) - I See Dead People: Gray-Box Adversarial Attack on Image-To-Text Models [0.0]
We present a gray-box adversarial attack on image-to-text, both untargeted and targeted.
Our attack operates in a gray-box manner, requiring no knowledge about the decoder module.
We also show that our attacks fool the popular open-source platform Hugging Face.
arXiv Detail & Related papers (2023-06-13T07:35:28Z) - Content-based Unrestricted Adversarial Attack [53.181920529225906]
We propose a novel unrestricted attack framework called Content-based Unrestricted Adversarial Attack.
By leveraging a low-dimensional manifold that represents natural images, we map the images onto the manifold and optimize them along its adversarial direction.
arXiv Detail & Related papers (2023-05-18T02:57:43Z) - Robust Contrastive Language-Image Pre-training against Data Poisoning
and Backdoor Attacks [52.26631767748843]
We propose ROCLIP, the first effective method for robust pre-training multimodal vision-language models against targeted data poisoning and backdoor attacks.
ROCLIP effectively breaks the association between poisoned image-caption pairs by considering a relatively large and varying pool of random captions.
Our experiments show that ROCLIP renders state-of-the-art targeted data poisoning and backdoor attacks ineffective during pre-training CLIP models.
arXiv Detail & Related papers (2023-03-13T04:49:46Z) - Rickrolling the Artist: Injecting Backdoors into Text Encoders for
Text-to-Image Synthesis [16.421253324649555]
We introduce backdoor attacks against text-guided generative models.
Our attacks only slightly alter an encoder so that no suspicious model behavior is apparent for image generations with clean prompts.
arXiv Detail & Related papers (2022-11-04T12:36:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.