Oedipus and the Sphinx: Benchmarking and Improving Visual Language Models for Complex Graphic Reasoning
- URL: http://arxiv.org/abs/2508.00323v1
- Date: Fri, 01 Aug 2025 05:12:38 GMT
- Title: Oedipus and the Sphinx: Benchmarking and Improving Visual Language Models for Complex Graphic Reasoning
- Authors: Jianyi Zhang, Xu Ji, Ziyin Zhou, Yuchen Zhou, Shubo Shi, Haoyu Wu, Zhen Li, Shizhao Liu,
- Abstract summary: We propose ReasonBench to evaluate the performance of visual language models (VLMs) in graphic reasoning tasks.<n>ReasonBench includes 1,613 questions from real-world intelligence tests.<n>We benchmark 11 mainstream VLMs and reveal significant limitations of current models.
- Score: 14.984593408786045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating the performance of visual language models (VLMs) in graphic reasoning tasks has become an important research topic. However, VLMs still show obvious deficiencies in simulating human-level graphic reasoning capabilities, especially in complex graphic reasoning and abstract problem solving, which are less studied and existing studies only focus on simple graphics. To evaluate the performance of VLMs in complex graphic reasoning, we propose ReasonBench, the first evaluation benchmark focused on structured graphic reasoning tasks, which includes 1,613 questions from real-world intelligence tests. ReasonBench covers reasoning dimensions related to location, attribute, quantity, and multi-element tasks, providing a comprehensive evaluation of the performance of VLMs in spatial, relational, and abstract reasoning capabilities. We benchmark 11 mainstream VLMs (including closed-source and open-source models) and reveal significant limitations of current models. Based on these findings, we propose a dual optimization strategy: Diagrammatic Reasoning Chain (DiaCoT) enhances the interpretability of reasoning by decomposing layers, and ReasonTune enhances the task adaptability of model reasoning through training, all of which improves VLM performance by 33.5\%. All experimental data and code are in the repository: https://huggingface.co/datasets/cistine/ReasonBench.
Related papers
- Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning [105.25503508433758]
We introduce $textbfZebra-CoT$, a diverse large-scale dataset with 182,384 samples.<n>We focus on four categories of tasks where sketching or visual reasoning is especially natural.<n>Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains.
arXiv Detail & Related papers (2025-07-22T16:35:36Z) - VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning [66.84770041828462]
Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks.<n> Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics.<n>We propose VisuRiddles, a benchmark for PRS, featuring tasks meticulously constructed to assess models' reasoning capacities.<n>Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions.
arXiv Detail & Related papers (2025-06-03T07:24:00Z) - Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z) - GraphOmni: A Comprehensive and Extendable Benchmark Framework for Large Language Models on Graph-theoretic Tasks [26.992997870540435]
Graph Omni is a benchmark to evaluate the reasoning capabilities of LLMs on graph-theoretic tasks articulated in natural language.<n>We identify critical interactions among graph types, serialization formats, and prompting schemes, demonstrating their substantial impact on model performance.<n>We propose a reinforcement learning-inspired framework that adaptively selects the optimal factors influencing LLM reasoning capabilities.
arXiv Detail & Related papers (2025-04-17T09:01:16Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles [91.88062410741833]
We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning.<n>We show that OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z) - VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs [4.381263829108405]
Vision-Language Models (VLMs) are known to struggle with spatial reasoning and visual alignment.<n>We introduce iVISPAR, an interactive multi-modal benchmark designed to evaluate the spatial reasoning capabilities of VLMs acting as agents.
arXiv Detail & Related papers (2025-02-05T14:29:01Z) - A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs [3.2228025627337864]
This paper introduces a structured evaluation framework to dissect the perception-reasoning interface in Vision-Language Models (VLMs)<n>We propose three distinct evaluation paradigms, mirroring human problem-solving strategies.<n>Applying this framework, we demonstrate that CA, leveraging powerful language models for reasoning over rich, independently generated descriptions, achieves new state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2025-01-23T12:42:42Z) - Intriguing Properties of Large Language and Vision Models [18.449076451976236]
Large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance.
Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks remains surprisingly low.
We investigate this question by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks.
arXiv Detail & Related papers (2024-10-07T05:07:01Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.