Related papers: Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models

Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models

URL: http://arxiv.org/abs/2503.11519v4
Date: Wed, 05 Nov 2025 14:45:59 GMT
Title: Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models
Authors: Hao Cheng, Erjia Xiao, Yichi Wang, Lingfeng Zhang, Qiang Zhang, Jiahang Cao, Kaidi Xu, Mengshu Sun, Xiaoshuai Hao, Jindong Gu, Renjing Xu,
Abstract summary: Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks.<n>This paper comprehensively investigates the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs.<n>We thoroughly evaluate the TVPI security risks on various open-source and closed-source LVLMs and I2I GMs under visual prompts with different target semantics.
Score: 64.55456491855678
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-Vision tasks, encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), have attracted significant attention. Large Vision Language Models (LVLMs) and I2I Generation Models (GMs) are employed to handle VLP and I2I tasks, respectively. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to produce disruptive outputs that are semantically aligned with those words. Additionally, visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of cross-vision tasks. However, the specific characteristics of the threats posed by visual prompts remain underexplored. In this paper, to comprehensively investigate the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs, we propose the Typographic Visual Prompts Injection Dataset and thoroughly evaluate the TVPI security risks on various open-source and closed-source LVLMs and I2I GMs under visual prompts with different target semantics, deepening the understanding of TVPI threats.

Related papers

Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models [0.0]
Existing evaluation datasets lean towards text-only prompts, leaving visual vulnerabilities under evaluated.<n>We propose Text2VLM, a novel multi-stage pipeline that adapts text-only datasets into multimodal formats.<n>Text2VLM provides a scalable tool for comprehensive safety assessment, contributing to the development of more robust safety mechanisms for Visual Language Models.
arXiv Detail & Related papers (2025-07-28T10:57:44Z)
SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding [5.976839106353883]
SECOND: Selective and Contrastive Decoding is a novel approach that enables Vision-Language Models to leverage multi-scale visual information with an object-centric manner.<n> SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks.
arXiv Detail & Related papers (2025-06-10T02:55:38Z)
Blind Spot Navigation: Evolutionary Discovery of Sensitive Semantic Concepts for LVLMs [24.76767896607915]
Recent research suggests that models may be particularly sensitive to certain semantics in visual inputs, making them prone to errors.<n>Inspired by this, in this paper we conducted the first exploration on large vision-language models (LVLMs)<n>We found that LVLMs indeed are susceptible to hallucinations and various errors when facing specific semantic concepts in images.
arXiv Detail & Related papers (2025-05-21T08:45:43Z)
Learning Knowledge-based Prompts for Robust 3D Mask Presentation Attack Detection [71.60120616284246]
We propose a novel knowledge-based prompt learning framework to explore the strong generalization capability of vision-language models for 3D mask presentation attack detection.<n> Experimental results demonstrate that the proposed method achieves state-of-the-art intra- and cross-scenario detection performance.
arXiv Detail & Related papers (2025-05-06T15:09:23Z)
BMRL: Bi-Modal Guided Multi-Perspective Representation Learning for Zero-Shot Deepfake Attribution [19.78648266444095]
We propose a novel framework for zero-shot deepfake attribution (ZS-DFA) Specifically, we design a multi-perspective visual encoder (MPVE) to explore general deepfake attribution visual characteristics across three views. A language encoder is proposed to capture fine-grained language embeddings, facilitating language-guided general visual representation learning.
arXiv Detail & Related papers (2025-04-19T01:11:46Z)
Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models [93.46875303598577]
Vision-language models (VLMs) have advanced rapidly in processing multimodal information, but their ability to reconcile conflicting signals remains underexplored. This work investigates how VLMs process ASCII art, a unique medium where textual elements collectively form visual patterns, potentially creating semantic-visual conflicts.
arXiv Detail & Related papers (2025-04-02T10:47:07Z)
TrojVLM: Backdoor Attack Against Vision Language Models [50.87239635292717]
This study introduces TrojVLM, the first exploration of backdoor attacks aimed at Vision Language Models (VLMs) TrojVLM inserts predetermined target text into output text when encountering poisoned images. A novel semantic preserving loss is proposed to ensure the semantic integrity of the original image content.
arXiv Detail & Related papers (2024-09-28T04:37:09Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model [23.764618459753326]
The Typographic Attack has also been expected to be a security threat to LVLMs. We verify typographic attacks on current well-known commercial and open-source LVLMs. To better assess this vulnerability, we propose the most comprehensive and largest-scale Typographic dataset to date.
arXiv Detail & Related papers (2024-02-29T13:31:56Z)
Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo) DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs) We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z)
Voila-A: Aligning Vision-Language Models with User's Gaze Attention [56.755993500556734]
We introduce gaze information as a proxy for human attention to guide Vision-Language Models (VLMs) We propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
arXiv Detail & Related papers (2023-12-22T17:34:01Z)
Delving into Multimodal Prompting for Fine-grained Visual Classification [57.12570556836394]
Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks. We propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image subcategory (CLIP) model.
arXiv Detail & Related papers (2023-09-16T07:30:52Z)
SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery [15.490603884631764]
We develop an end-to-end trainable Language-Vision GPT model that expands the GPT2 model to include vision input (image) We prove that the LV-GPT model outperforms other state-of-the-art VQA models on two publically available surgical-VQA datasets.
arXiv Detail & Related papers (2023-04-19T21:22:52Z)
Visually-augmented pretrained language models for NLP tasks without images [77.74849855049523]
Existing solutions often rely on explicit images for visual knowledge augmentation. We propose a novel textbfVisually-textbfAugmented fine-tuning approach. Our approach can consistently improve the performance of BERT, RoBERTa, BART, and T5 at different scales.
arXiv Detail & Related papers (2022-12-15T16:13:25Z)
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language. We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language. We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.