Related papers: InfoCausalQA:Can Models Perform Non-explicit Causal Reasoning Based on Infographic?

InfoCausalQA:Can Models Perform Non-explicit Causal Reasoning Based on Infographic?

URL: http://arxiv.org/abs/2508.06220v2
Date: Wed, 13 Aug 2025 07:02:35 GMT
Title: InfoCausalQA:Can Models Perform Non-explicit Causal Reasoning Based on Infographic?
Authors: Keummin Ka, Junhyeong Park, Jaehyun Jeon, Youngjae Yu,
Abstract summary: We introduce InfoCausalQA, a novel benchmark designed to evaluate causal reasoning grounded in infographics.<n>The benchmark comprises two tasks: Task 1 focuses on quantitative causal reasoning based on inferred numerical trends, while Task 2 targets semantic causal reasoning involving five types of causal relations.<n>Our experimental results reveal that current Vision-Language Models exhibit limited capability in computational reasoning and even more pronounced limitations in semantic causal reasoning.
Score: 14.443840118369176
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advances in Vision-Language Models (VLMs) have demonstrated impressive capabilities in perception and reasoning. However, the ability to perform causal inference -- a core aspect of human cognition -- remains underexplored, particularly in multimodal settings. In this study, we introduce InfoCausalQA, a novel benchmark designed to evaluate causal reasoning grounded in infographics that combine structured visual data with textual context. The benchmark comprises two tasks: Task 1 focuses on quantitative causal reasoning based on inferred numerical trends, while Task 2 targets semantic causal reasoning involving five types of causal relations: cause, effect, intervention, counterfactual, and temporal. We manually collected 494 infographic-text pairs from four public sources and used GPT-4o to generate 1,482 high-quality multiple-choice QA pairs. These questions were then carefully revised by humans to ensure they cannot be answered based on surface-level cues alone but instead require genuine visual grounding. Our experimental results reveal that current VLMs exhibit limited capability in computational reasoning and even more pronounced limitations in semantic causal reasoning. Their significantly lower performance compared to humans indicates a substantial gap in leveraging infographic-based information for causal inference. Through InfoCausalQA, we highlight the need for advancing the causal reasoning abilities of multimodal AI systems.

Related papers

Inverse Scaling in Test-Time Compute [51.16323216811257]
Extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance.<n>We identify five distinct failure modes when models reason for longer.<n>These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns.
arXiv Detail & Related papers (2025-07-19T00:06:13Z)
What's Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning [26.671128120554457]
causal reasoning is fundamental to solving complex high-level reasoning tasks.<n>Existing benchmarks often include a mixture of reasoning questions.<n>We introduce VQA-Causal and VCR-Causal to isolate and rigorously evaluate causal reasoning abilities.
arXiv Detail & Related papers (2025-06-01T07:17:46Z)
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z)
ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models [48.994853869901974]
We conduct a case study using a synthetic dataset solvable only through visual reasoning.<n>We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions.<n>Although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%.
arXiv Detail & Related papers (2025-05-19T17:59:27Z)
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories.<n>These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives.<n>Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z)
CAUSAL3D: A Comprehensive Benchmark for Causal Learning from Visual Data [10.435321637846142]
We introduce textsctextbfCausal3D, a novel benchmark that integrates structured data (tables) with corresponding visual representations (images) to evaluate causal reasoning.<n>Causal3D comprises 19 3D-scene datasets capturing diverse causal relations, views, and backgrounds.
arXiv Detail & Related papers (2025-03-06T03:40:01Z)
CELLO: Causal Evaluation of Large Vision-Language Models [9.928321287432365]
Causal reasoning is fundamental to human intelligence and crucial for effective decision-making in real-world environments. We introduce a fine-grained and unified definition of causality involving interactions between humans and objects. We construct a novel dataset, CELLO, consisting of 14,094 causal questions across all four levels of causality.
arXiv Detail & Related papers (2024-06-27T12:34:52Z)
Prompting or Fine-tuning? Exploring Large Language Models for Causal Graph Validation [0.0]
This study explores the capability of Large Language Models to evaluate causality in causal graphs.<n>Our study compares two approaches: (1) prompting-based method for zero-shot and few-shot causal inference and, (2) fine-tuning language models for the causal relation prediction task.
arXiv Detail & Related papers (2024-05-29T09:06:18Z)
Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification [41.330719056639616]
We study the entailment verification problem of multi-sentence premises. Modern NLP problems, such as detecting inconsistent model-generated rationales, require complex multi-hop reasoning.
arXiv Detail & Related papers (2024-02-06T04:14:09Z)
Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering [58.64831511644917]
We introduce an interpretable by design model that factors model decisions into intermediate human-legible explanations. We show that our inherently interpretable system can improve 4.64% over a comparable black-box system in reasoning-focused questions.
arXiv Detail & Related papers (2023-05-24T08:33:15Z)
A Song of (Dis)agreement: Evaluating the Evaluation of Explainable Artificial Intelligence in Natural Language Processing [7.527234046228323]
We argue that the community should stop using rank correlation as an evaluation metric for attention-based explanations. We find that attention-based explanations do not correlate strongly with any recent feature attribution methods.
arXiv Detail & Related papers (2022-05-09T21:07:39Z)
Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" [49.76230210108583]
We propose a framework to isolate and evaluate the reasoning aspect of visual question answering (VQA) separately from its perception. We also propose a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception. On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models.
arXiv Detail & Related papers (2020-06-20T08:48:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.