Related papers: Reasoning in Computer Vision: Taxonomy, Models, Tasks, and Methodologies

Reasoning in Computer Vision: Taxonomy, Models, Tasks, and Methodologies

URL: http://arxiv.org/abs/2508.10523v1
Date: Thu, 14 Aug 2025 10:53:35 GMT
Title: Reasoning in Computer Vision: Taxonomy, Models, Tasks, and Methodologies
Authors: Ayushman Sarkar, Mohd Yamani Idna Idris, Zhenyu Yu,
Abstract summary: This survey aims to categorizing visual reasoning into five major types (relational, symbolic, temporal, causal, and commonsense)<n>We review evaluation protocols designed to assess functional correctness, structural consistency, and causal validity, and critically analyze their limitations in terms of generalizability, explanatory power.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual reasoning is critical for a wide range of computer vision tasks that go beyond surface-level object detection and classification. Despite notable advances in relational, symbolic, temporal, causal, and commonsense reasoning, existing surveys often address these directions in isolation, lacking a unified analysis and comparison across reasoning types, methodologies, and evaluation protocols. This survey aims to address this gap by categorizing visual reasoning into five major types (relational, symbolic, temporal, causal, and commonsense) and systematically examining their implementation through architectures such as graph-based models, memory networks, attention mechanisms, and neuro-symbolic systems. We review evaluation protocols designed to assess functional correctness, structural consistency, and causal validity, and critically analyze their limitations in terms of generalizability, reproducibility, and explanatory power. Beyond evaluation, we identify key open challenges in visual reasoning, including scalability to complex scenes, deeper integration of symbolic and neural paradigms, the lack of comprehensive benchmark datasets, and reasoning under weak supervision. Finally, we outline a forward-looking research agenda for next-generation vision systems, emphasizing that bridging perception and reasoning is essential for building transparent, trustworthy, and cross-domain adaptive AI systems, particularly in critical domains such as autonomous driving and medical diagnostics.

Related papers

Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics [89.1999907891494]
We present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox.<n>Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures.<n>We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies.
arXiv Detail & Related papers (2025-10-01T07:59:03Z)
Explaining What Machines See: XAI Strategies in Deep Object Detection Models [0.0]
Explainable Artificial Intelligence (XAI) aims to make model decisions more transparent, interpretable, and trust-worthy for humans.<n>This review provides a comprehensive analysis of state-of-the-art explainability methods specifically applied to object detection models.
arXiv Detail & Related papers (2025-09-02T06:16:30Z)
Explain Before You Answer: A Survey on Compositional Visual Reasoning [74.27548620675748]
Compositional visual reasoning has emerged as a key research frontier in multimodal AI.<n>This survey systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.)<n>We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain-of-thought faithfulness, and high-resolution perception.
arXiv Detail & Related papers (2025-08-24T11:01:51Z)
Hyperspectral Imaging [49.45523645429475]
Hyperspectral imaging (HSI) is an advanced sensing modality that simultaneously captures spatial and spectral information.<n>This Primer presents a comprehensive overview of HSI, from the underlying physical principles and sensor architectures to key steps in data acquisition, calibration, and correction.
arXiv Detail & Related papers (2025-08-11T15:47:24Z)
Anomaly Detection and Generation with Diffusion Models: A Survey [51.61574868316922]
Anomaly detection (AD) plays a pivotal role across diverse domains, including cybersecurity, finance, healthcare, and industrial manufacturing.<n>Recent advancements in deep learning, specifically diffusion models (DMs), have sparked significant interest.<n>This survey aims to guide researchers and practitioners in leveraging DMs for innovative AD solutions across diverse applications.
arXiv Detail & Related papers (2025-06-11T03:29:18Z)
MIRAGE: A Multi-modal Benchmark for Spatial Perception, Reasoning, and Intelligence [14.694404760882986]
MIRAGE is a benchmark designed to evaluate models' capabilities in Counting (object attribute recognition), Relation (spatial relational reasoning), and Counting with Relation.<n>By targeting these foundational abilities, MIRAGE provides a pathway toward spatial recognition towardtemporal reasoning in future research.
arXiv Detail & Related papers (2025-05-15T16:08:14Z)
A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs [3.2228025627337864]
This paper introduces a structured evaluation framework to dissect the perception-reasoning interface in Vision-Language Models (VLMs)<n>We propose three distinct evaluation paradigms, mirroring human problem-solving strategies.<n>Applying this framework, we demonstrate that CA, leveraging powerful language models for reasoning over rich, independently generated descriptions, achieves new state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2025-01-23T12:42:42Z)
Fairness meets Cross-Domain Learning: a new perspective on Models and Metrics [80.07271410743806]
We study the relationship between cross-domain learning (CD) and model fairness. We introduce a benchmark on face and medical images spanning several demographic groups as well as classification and localization tasks. Our study covers 14 CD approaches alongside three state-of-the-art fairness algorithms and shows how the former can outperform the latter.
arXiv Detail & Related papers (2023-03-25T09:34:05Z)
Causal Reasoning Meets Visual Representation Learning: A Prospective Study [117.08431221482638]
Lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models. Inspired by the strong inference ability of human-level agents, recent years have witnessed great effort in developing causal reasoning paradigms. This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods.
arXiv Detail & Related papers (2022-04-26T02:22:28Z)
Fairness Indicators for Systematic Assessments of Visual Feature Extractors [21.141633753573764]
We propose three fairness indicators, which aim at quantifying harms and biases of visual systems. Our indicators use existing publicly available datasets collected for fairness evaluations. These indicators are not intended to be a substitute for a thorough analysis of the broader impact of the new computer vision technologies.
arXiv Detail & Related papers (2022-02-15T17:45:33Z)
Automatic Gaze Analysis: A Survey of DeepLearning based Approaches [61.32686939754183]
Eye gaze analysis is an important research problem in the field of computer vision and Human-Computer Interaction. There are several open questions including what are the important cues to interpret gaze direction in an unconstrained environment. We review the progress across a range of gaze analysis tasks and applications to shed light on these fundamental questions.
arXiv Detail & Related papers (2021-08-12T00:30:39Z)
Deep Co-Attention Network for Multi-View Subspace Learning [73.3450258002607]
We propose a deep co-attention network for multi-view subspace learning. It aims to extract both the common information and the complementary information in an adversarial setting. In particular, it uses a novel cross reconstruction loss and leverages the label information to guide the construction of the latent representation.
arXiv Detail & Related papers (2021-02-15T18:46:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.