Skeletonization-Based Adversarial Perturbations on Large Vision Language Model's Mathematical Text Recognition
- URL: http://arxiv.org/abs/2601.04752v1
- Date: Thu, 08 Jan 2026 09:15:27 GMT
- Title: Skeletonization-Based Adversarial Perturbations on Large Vision Language Model's Mathematical Text Recognition
- Authors: Masatomo Yoshida, Haruto Namura, Nicola Adami, Masahiro Okuda,
- Abstract summary: This work explores the visual capabilities and limitations of foundation models by introducing a novel adversarial attack method.<n>Our approach specifically targets images containing text, particularly mathematical formula images, which are more challenging due to their conversion and intricate structure.
- Score: 2.729898906885749
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work explores the visual capabilities and limitations of foundation models by introducing a novel adversarial attack method utilizing skeletonization to reduce the search space effectively. Our approach specifically targets images containing text, particularly mathematical formula images, which are more challenging due to their LaTeX conversion and intricate structure. We conduct a detailed evaluation of both character and semantic changes between original and adversarially perturbed outputs to provide insights into the models' visual interpretation and reasoning abilities. The effectiveness of our method is further demonstrated through its application to ChatGPT, which shows its practical implications in real-world scenarios.
Related papers
- Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models [0.609170287691728]
We introduce a novel training paradigm to enhance the comprehension of diagrammatic images within vision-language models.<n>Our method enables models to develop a more structured and semantically coherent understanding of diagrammatic content.
arXiv Detail & Related papers (2025-09-02T05:02:23Z) - Conceptual Contrastive Edits in Textual and Vision-Language Retrieval [1.8591405259852054]
We employ post-hoc conceptual contrastive edits to expose noteworthy patterns and biases imprinted in representations of retrieval models.<n>We apply these edits to explain both linguistic and visiolinguistic pre-trained models in a black-box manner.<n>We also introduce a novel metric to assess the per-word impact of contrastive interventions on model outcomes.
arXiv Detail & Related papers (2025-03-01T10:14:28Z) - Object-centric Binding in Contrastive Language-Image Pretraining [9.376583779399834]
We propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations.<n>Our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives.<n>Our resulting model paves the way towards more accurate and sample-efficient image-text matching of complex scenes.
arXiv Detail & Related papers (2025-02-19T21:30:51Z) - Information Theoretic Text-to-Image Alignment [49.396917351264655]
Mutual Information (MI) is used to guide model alignment.<n>Our method uses self-supervised fine-tuning and relies on a point-wise (MI) estimation between prompts and images.<n>Our analysis indicates that our method is superior to the state-of-the-art, yet it only requires the pre-trained denoising network of the T2I model itself to estimate MI.
arXiv Detail & Related papers (2024-05-31T12:20:02Z) - Visual In-Context Learning for Large Vision-Language Models [62.5507897575317]
In Large Visual Language Models (LVLMs) the efficacy of In-Context Learning (ICL) remains limited by challenges in cross-modal interactions and representation disparities.
We introduce a novel Visual In-Context Learning (VICL) method comprising Visual Demonstration Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented Demonstration Composition.
Our approach retrieves images via ''Retrieval & Rerank'' paradigm, summarises images with task intent and task-specific visual parsing, and composes language-based demonstrations.
arXiv Detail & Related papers (2024-02-18T12:43:38Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - How Well Do Text Embedding Models Understand Syntax? [50.440590035493074]
The ability of text embedding models to generalize across a wide range of syntactic contexts remains under-explored.
Our findings reveal that existing text embedding models have not sufficiently addressed these syntactic understanding challenges.
We propose strategies to augment the generalization ability of text embedding models in diverse syntactic scenarios.
arXiv Detail & Related papers (2023-11-14T08:51:00Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - Auto-Encoding Variational Bayes for Inferring Topics and Visualization [2.132096006921048]
visualization and topic modeling are widely used approaches for text analysis.
Recent approaches propose using a generative model to jointly find topics and visualization.
We present, to the best of our knowledge, the first fast Auto- Variational Bayes based inference method for jointly inferring topics and visualization.
arXiv Detail & Related papers (2020-10-19T05:57:11Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.