Related papers: Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding

Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding

URL: http://arxiv.org/abs/2509.22437v1
Date: Fri, 26 Sep 2025 14:55:04 GMT
Title: Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding
Authors: Ziheng Chi, Yifan Hou, Chenxi Pang, Shaobo Cui, Mubashara Akhtar, Mrinmaya Sachan,
Abstract summary: We introduce Chimera, a test suite comprising 7,500 high-quality diagrams sourced from Wikipedia.<n>Each diagram is annotated with its symbolic content represented by semantic triples along with multi-level questions.<n>We use Chimera to measure the presence of three types of shortcuts in visual question answering.
Score: 44.53837800796001
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diagrams convey symbolic information in a visual format rather than a linear stream of words, making them especially challenging for AI models to process. While recent evaluations suggest that vision-language models (VLMs) perform well on diagram-related benchmarks, their reliance on knowledge, reasoning, or modality shortcuts raises concerns about whether they genuinely understand and reason over diagrams. To address this gap, we introduce Chimera, a comprehensive test suite comprising 7,500 high-quality diagrams sourced from Wikipedia; each diagram is annotated with its symbolic content represented by semantic triples along with multi-level questions designed to assess four fundamental aspects of diagram comprehension: entity recognition, relation understanding, knowledge grounding, and visual reasoning. We use Chimera to measure the presence of three types of shortcuts in visual question answering: (1) the visual-memorization shortcut, where VLMs rely on memorized visual patterns; (2) the knowledge-recall shortcut, where models leverage memorized factual knowledge instead of interpreting the diagram; and (3) the Clever-Hans shortcut, where models exploit superficial language patterns or priors without true comprehension. We evaluate 15 open-source VLMs from 7 model families on Chimera and find that their seemingly strong performance largely stems from shortcut behaviors: visual-memorization shortcuts have slight impact, knowledge-recall shortcuts play a moderate role, and Clever-Hans shortcuts contribute significantly. These findings expose critical limitations in current VLMs and underscore the need for more robust evaluation protocols that benchmark genuine comprehension of complex visual inputs (e.g., diagrams) rather than question-answering shortcuts.

Related papers

Proactive Agentic Whiteboards: Enhancing Diagrammatic Learning [1.338174941551702]
We introduce DrawDash, an AI-powered whiteboard assistant that proactively completes and refines educational diagrams through multimodal understanding.<n>We demonstrate DrawDash across four diverse teaching scenarios, spanning topics from computer science and web development to biology.<n>This work represents an early exploration into reducing instructors' cognitive load and improving diagram-based pedagogy through real-time, speech-driven visual assistance, and concludes with a discussion of current limitations and directions for formal classroom evaluation.
arXiv Detail & Related papers (2025-12-01T03:20:12Z)
ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding [18.67532755744138]
Automated chart understanding poses significant challenges to existing multimodal large language models.<n>Current step-by-step reasoning models primarily focus on text-based logical reasoning for chart understanding.<n>We propose ChartSketcher, a multimodal feedback-driven step-by-step reasoning method designed to address these limitations.
arXiv Detail & Related papers (2025-05-25T10:21:29Z)
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z)
Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation [19.4261670152456]
We introduce the multimodal solution explanation task, designed to evaluate whether models can identify visual keypoints, such as auxiliary lines, points, angles, and generate explanations that incorporate these key elements essential for understanding.<n>Our empirical results show that, aside from recent large-scale open-source and closed-source models, most generalist open-source models, and even math-specialist models, struggle with the multimodal solution explanation task.<n>This highlights a significant gap in current LLMs' ability to reason and explain with visual grounding in educational contexts.
arXiv Detail & Related papers (2025-04-04T06:03:13Z)
Distill Visual Chart Reasoning Ability from LLMs to MLLMs [64.32993770646165]
Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs)<n>We propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs.<n>ReachQA is a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs to enhance both recognition and reasoning abilities of MLLMs.
arXiv Detail & Related papers (2024-10-24T14:50:42Z)
Do Vision-Language Models Really Understand Visual Language? [43.893398898373995]
Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image.<n>Recent studies suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams.<n>This paper develops a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs.
arXiv Detail & Related papers (2024-09-30T19:45:11Z)
SADL: An Effective In-Context Learning Method for Compositional Visual QA [22.0603596548686]
Large vision-language models (LVLMs) offer a novel capability for performing in-context learning (ICL) in Visual QA. This paper introduces SADL, a new visual-linguistic prompting framework for the task.
arXiv Detail & Related papers (2024-07-02T06:41:39Z)
Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal Sentence Localization in Videos [67.12603318660689]
We propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN) HVSARN enables both visual- and semantic-aware query reasoning from object-level to frame-level. Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-02T08:00:22Z)
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning. IPVR contains three stages, see, think and confirm. We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z)
Why Machine Reading Comprehension Models Learn Shortcuts? [56.629192589376046]
We argue that larger proportion of shortcut questions in training data make models rely on shortcut tricks excessively. A thorough empirical analysis shows that MRC models tend to learn shortcut questions earlier than challenging questions.
arXiv Detail & Related papers (2021-06-02T08:43:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.