Related papers: Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad?

Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad?

URL: http://arxiv.org/abs/2410.19546v2
Date: Tue, 18 Feb 2025 14:38:12 GMT
Title: Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad?
Authors: Antonia Wüst, Tim Tobiasch, Lukas Helff, Inga Ibs, Wolfgang Stammer, Devendra S. Dhami, Constantin A. Rothkopf, Kristian Kersting,
Abstract summary: We evaluate newly developed Vision-Language Models (VLMs)<n>We show that while VLMs occasionally succeed in identifying discriminative concepts, they frequently falter.<n>We observe that a significant gap remains between human visual reasoning capabilities and machine cognition.
Score: 20.5633138423677
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, newly developed Vision-Language Models (VLMs), such as OpenAI's o1, have emerged, seemingly demonstrating advanced reasoning capabilities across text and image modalities. However, the depth of these advances in language-guided perception and abstract reasoning remains underexplored, and it is unclear whether these models can truly live up to their ambitious promises. To assess the progress and identify shortcomings, we enter the wonderland of Bongard problems, a set of classic visual reasoning puzzles that require human-like abilities of pattern recognition and abstract reasoning. With our extensive evaluation setup, we show that while VLMs occasionally succeed in identifying discriminative concepts and solving some of the problems, they frequently falter. Surprisingly, even elementary concepts that may seem trivial to humans, such as simple spirals, pose significant challenges. Moreover, when explicitly asked to recognize ground truth concepts, they continue to falter, suggesting not only a lack of understanding of these elementary visual concepts but also an inability to generalize to unseen concepts. We compare the results of VLMs to human performance and observe that a significant gap remains between human visual reasoning capabilities and machine cognition.

Related papers

Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models [51.900488744931785]
We introduce the Visual Graph Arena (VGA) to evaluate and improve AI systems' capacity for visual abstraction.<n>Humans achieve near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks.<n>By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models.
arXiv Detail & Related papers (2025-06-06T17:06:25Z)
VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search [89.43196232124883]
VisuoThink is a novel framework that seamlessly integrates visuospatial and linguistic domains. It enables progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search.
arXiv Detail & Related papers (2025-04-12T08:37:30Z)
A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs [3.2228025627337864]
This paper introduces a structured evaluation framework using Bongard Problems (BPs) to dissect the perception-reasoning interface in Vision-Language Models (VLMs) We propose three distinct evaluation paradigms, mirroring human problem-solving strategies. Our framework provides a valuable diagnostic tool, highlighting the need to enhance visual processing fidelity for achieving more robust and human-like visual intelligence in AI.
arXiv Detail & Related papers (2025-01-23T12:42:42Z)
Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts? [62.984473889987605]
We present a zero-shot framework for fine-grained visual concept learning by leveraging large language model and Visual Question Answering (VQA) system. We pose these questions along with the query image to a VQA system and aggregate the answers to determine the presence or absence of an object in the test images. Our experiments demonstrate comparable performance with existing zero-shot visual classification methods and few-shot concept learning approaches.
arXiv Detail & Related papers (2024-10-17T15:16:10Z)
Do Vision-Language Models Really Understand Visual Language? [43.893398898373995]
Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image. Recent studies suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams. This paper develops a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs.
arXiv Detail & Related papers (2024-09-30T19:45:11Z)
What Makes a Maze Look Like a Maze? [92.80800000328277]
We introduce Deep Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning. At the core of DSG are schemas--dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols. We show that DSG significantly improves the abstract visual reasoning performance of vision-language models.
arXiv Detail & Related papers (2024-09-12T16:41:47Z)
PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns [69.17409440805498]
We evaluate large multimodal models with abstract patterns based on fundamental concepts. We find that they are not able to generalize well to simple abstract patterns. Our systematic analysis finds that the main bottlenecks of GPT-4V are weaker visual perception and inductive reasoning abilities.
arXiv Detail & Related papers (2024-03-20T05:37:24Z)
Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World [57.832261258993526]
Bongard-OpenWorld is a new benchmark for evaluating real-world few-shot reasoning for machine vision. It already imposes a significant challenge to current few-shot reasoning algorithms.
arXiv Detail & Related papers (2023-10-16T09:19:18Z)
Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning [78.13740873213223]
Bongard problems (BPs) were introduced as an inspirational challenge for visual cognition in intelligent systems. We propose a new benchmark Bongard-LOGO for human-level concept learning and reasoning.
arXiv Detail & Related papers (2020-10-02T03:19:46Z)
Multi-Granularity Modularized Network for Abstract Visual Reasoning [15.956555435408557]
We focus on the Raven Progressive Matrices Test, designed to measure cognitive reasoning. Inspired by cognitive studies, we propose a Multi-Granularity Modularized Network (MMoN) to bridge the gap between the processing of raw sensory information and symbolic reasoning.
arXiv Detail & Related papers (2020-07-09T09:54:05Z)
Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" [49.76230210108583]
We propose a framework to isolate and evaluate the reasoning aspect of visual question answering (VQA) separately from its perception. We also propose a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception. On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models.
arXiv Detail & Related papers (2020-06-20T08:48:29Z)
Dark, Beyond Deep: A Paradigm Shift to Cognitive AI with Humanlike Common Sense [142.53911271465344]
We argue that the next generation of AI must embrace "dark" humanlike common sense for solving novel tasks. We identify functionality, physics, intent, causality, and utility (FPICU) as the five core domains of cognitive AI with humanlike common sense.
arXiv Detail & Related papers (2020-04-20T04:07:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.