Related papers: Do Vision-Language Models Really Understand Visual Language?

Do Vision-Language Models Really Understand Visual Language?

URL: http://arxiv.org/abs/2410.00193v2
Date: Sat, 25 Jan 2025 00:16:53 GMT
Title: Do Vision-Language Models Really Understand Visual Language?
Authors: Yifan Hou, Buse Giledereli, Yilei Tu, Mrinmaya Sachan,
Abstract summary: Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image.<n>Recent studies suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams.<n>This paper develops a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs.
Score: 43.893398898373995
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual language is a system of communication that conveys information through symbols, shapes, and spatial arrangements. Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image. The symbolic nature of diagrams presents significant challenges for building models capable of understanding them. Recent studies suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams. In this paper, we investigate this phenomenon by developing a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs. Our test suite uses a variety of questions focused on concept entities and their relationships over a set of synthetic as well as real diagrams across domains to evaluate the recognition and reasoning abilities of models. Our evaluation of LVLMs shows that while they can accurately identify and reason about entities, their ability to understand relationships is notably limited. Further testing reveals that the decent performance on diagram understanding largely stems from leveraging their background knowledge as shortcuts to identify and reason about the relational information. Thus, we conclude that LVLMs have a limited capability for genuine diagram understanding, and their impressive performance in diagram reasoning is an illusion emanating from other confounding factors, such as the background knowledge in the models.

Related papers

Toward Cognitive Supersensing in Multimodal Large Language Model [67.15559571626747]
We introduce Cognitive Supersensing, a training paradigm that endows MLLMs with human-like visual imagery capabilities.<n>In experiments, MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench.<n>We will open-source the CogSense-Bench and our model weights.
arXiv Detail & Related papers (2026-02-02T02:19:50Z)
Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding [44.53837800796001]
We introduce Chimera, a test suite comprising 7,500 high-quality diagrams sourced from Wikipedia.<n>Each diagram is annotated with its symbolic content represented by semantic triples along with multi-level questions.<n>We use Chimera to measure the presence of three types of shortcuts in visual question answering.
arXiv Detail & Related papers (2025-09-26T14:55:04Z)
Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation [19.4261670152456]
We introduce a novel task of visual solution explanation, which requires generating explanations that incorporate newly introduced visual elements essential for understanding. We propose MathExplain, a benchmark consisting of 997 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that while some closed-source models demonstrate promising capabilities on visual solution-explaining, current open-source general-purpose models perform inconsistently.
arXiv Detail & Related papers (2025-04-04T06:03:13Z)
Learning Interpretable Logic Rules from Deep Vision Models [6.854329442341952]
VisionLogic is a framework to extract interpretable logic rules from deep vision models. It provides local explanations for single images and global explanations for specific classes. VisionLogic also facilitates the study of visual concepts encoded by predicates.
arXiv Detail & Related papers (2025-03-13T17:04:04Z)
A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs [3.2228025627337864]
This paper introduces a structured evaluation framework to dissect the perception-reasoning interface in Vision-Language Models (VLMs)<n>We propose three distinct evaluation paradigms, mirroring human problem-solving strategies.<n>Applying this framework, we demonstrate that CA, leveraging powerful language models for reasoning over rich, independently generated descriptions, achieves new state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2025-01-23T12:42:42Z)
Language Model as Visual Explainer [72.88137795439407]
We present a systematic approach for interpreting vision models using a tree-structured linguistic explanation. Our method provides human-understandable explanations in the form of attribute-laden trees. To access the effectiveness of our approach, we introduce new benchmarks and conduct rigorous evaluations.
arXiv Detail & Related papers (2024-12-08T20:46:23Z)
What Makes a Maze Look Like a Maze? [92.80800000328277]
We introduce Deep Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning. At the core of DSG are schemas--dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols. We show that DSG significantly improves the abstract visual reasoning performance of vision-language models.
arXiv Detail & Related papers (2024-09-12T16:41:47Z)
In-Context Learning Improves Compositional Understanding of Vision-Language Models [2.762909189433944]
compositional image understanding remains a rather difficult task due to the object bias present in training data. We compare contrastive models with generative ones and analyze their differences in architecture, pre-training data, and training tasks and losses. Our proposed approach outperforms baseline models across multiple compositional understanding datasets.
arXiv Detail & Related papers (2024-07-22T09:03:29Z)
Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models [69.79709804046325]
We introduce R-Bench, a novel benchmark for evaluating Vision Relationship Hallucination. R-Bench features image-level questions that focus on the existence of relationships and instance-level questions that assess local visual comprehension. We identify three types of relationship co-occurrences that lead to hallucinations: relationship-relationship, subject-relationship, and relationship-object.
arXiv Detail & Related papers (2024-06-24T08:42:42Z)
Are Large Vision Language Models up to the Challenge of Chart Comprehension and Reasoning? An Extensive Investigation into the Capabilities and Limitations of LVLMs [11.19928977117624]
Natural language is a powerful complementary modality of communication for data visualizations, such as bar and line charts. Various downstream tasks have been introduced recently such as chart question answering, chart summarization, and fact-checking with charts. These tasks pose a unique challenge, demanding both vision-language reasoning and a nuanced understanding of chart data tables, visual encodings, and natural language prompts. This paper presents the first comprehensive evaluation of the recently developed large vision language models (LVLMs) for chart understanding and reasoning tasks.
arXiv Detail & Related papers (2024-06-01T01:43:30Z)
RelationVLM: Making Large Vision-Language Models Understand Visual Relations [66.70252936043688]
We present RelationVLM, a large vision-language model capable of comprehending various levels and types of relations whether across multiple images or within a video. Specifically, we devise a multi-stage relation-aware training scheme and a series of corresponding data configuration strategies to bestow RelationVLM with the capabilities of understanding semantic relations.
arXiv Detail & Related papers (2024-03-19T15:01:19Z)
Enhance Reasoning Ability of Visual-Language Models via Large Language Models [7.283533791778359]
We propose a method called TReE, which transfers the reasoning ability of a large language model to a visual language model in zero-shot scenarios. TReE contains three stages: observation, thinking, and re-thinking.
arXiv Detail & Related papers (2023-05-22T17:33:44Z)
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning. IPVR contains three stages, see, think and confirm. We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z)
RL-CSDia: Representation Learning of Computer Science Diagrams [25.66215925641988]
We construct a novel dataset of graphic diagrams named Computer Science Diagrams (CSDia) It contains more than 1,200 diagrams and exhaustive annotations of objects and relations. Considering the visual noises caused by the various expressions in diagrams, we introduce the topology of diagrams to parse topological structure.
arXiv Detail & Related papers (2021-03-10T07:01:07Z)
Object Relational Graph with Teacher-Recommended Learning for Video Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy. Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation. Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.