Related papers: Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models

Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models

URL: http://arxiv.org/abs/2506.06242v1
Date: Fri, 06 Jun 2025 17:06:25 GMT
Title: Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models
Authors: Zahra Babaiee, Peyman M. Kiasari, Daniela Rus, Radu Grosu,
Abstract summary: We introduce the Visual Graph Arena (VGA) to evaluate and improve AI systems' capacity for visual abstraction.<n>Humans achieve near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks.<n>By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models.
Score: 51.900488744931785
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advancements in multimodal large language models have driven breakthroughs in visual question answering. Yet, a critical gap persists, `conceptualization'-the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. To address this challenge, we introduce the Visual Graph Arena (VGA), a dataset featuring six graph-based tasks designed to evaluate and improve AI systems' capacity for visual abstraction. VGA uses diverse graph layouts (e.g., Kamada-Kawai vs. planar) to test reasoning independent of visual form. Experiments with state-of-the-art vision models and multimodal LLMs reveal a striking divide: humans achieved near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks. We further identify behavioral anomalies suggesting pseudo-intelligent pattern matching rather than genuine understanding. These findings underscore fundamental limitations in current AI models for visual understanding. By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models. The Visual Graph Arena is available at: \href{https://vga.csail.mit.edu/}{vga.csail.mit.edu}

Related papers

VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs [18.349695067647012]
Visual Language Models excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple tests.<n>We present an evaluation that tests vision-language models' capacity for nonlocal visual reasoning.<n>Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.
arXiv Detail & Related papers (2025-07-04T23:15:52Z)
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning [72.81576836419373]
Chain-of-Thought (CoT) reasoning can be used to link visual cues across multiple images.<n>We adapt rule-based reinforcement learning for Vision-Language Models (VLMs)<n>Our method achieves significant improvements on multi-image reasoning benchmarks and shows strong performance on general vision tasks.
arXiv Detail & Related papers (2025-06-27T17:59:27Z)
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z)
Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions [0.03495246564946555]
We introduce a novel task called Illusory VQA, along with four specialized datasets: IllusionMNIST, IllusionFashionMNIST, IllusionAnimals, and IllusionChar.<n>These datasets are designed to evaluate the performance of state-of-the-art multimodal models in recognizing and interpreting visual illusions.
arXiv Detail & Related papers (2024-12-11T07:51:18Z)
VAGUE: Visual Contexts Clarify Ambiguous Expressions [15.140825578254324]
We introduce VAGUE, a benchmark evaluating multimodal AI systems' ability to integrate visual context for intent.<n>VAGUE consists of 1.6K ambiguous textual expressions, each paired with an image and multiple-choice interpretations.<n>Our experiments reveal that existing multimodal AI models struggle to infer the speaker's true intent.
arXiv Detail & Related papers (2024-11-21T14:01:42Z)
Fill in the blanks: Rethinking Interpretability in vision [0.0]
We re-think vision-model explainability from a novel perspective, to probe the general input structure that a model has learnt during its training. Experiments on standard vision datasets and pre-trained models reveal consistent patterns, and could be intergrated as an additional model-agnostic explainability tool.
arXiv Detail & Related papers (2024-11-15T15:31:06Z)
PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns [69.17409440805498]
We evaluate large multimodal models with abstract patterns based on fundamental concepts. We find that they are not able to generalize well to simple abstract patterns. Our systematic analysis finds that the main bottlenecks of GPT-4V are weaker visual perception and inductive reasoning abilities.
arXiv Detail & Related papers (2024-03-20T05:37:24Z)
Visual Enumeration Remains Challenging for Multimodal Generative AI [0.08192907805418582]
It has been observed that even state-of-the-art AI systems have very limited enumeration skills.<n>We consider popular visual question answering models (BLIP, LLaVA and ViLT) as well as advanced image-to-text (Gemini, GPT and Qwen) AI systems.<n>Our analyses show that even the most advanced models cannot reliably name the number of objects in simple visual stimuli or generate images containing a target number of items.
arXiv Detail & Related papers (2024-01-09T18:18:32Z)
Look, Remember and Reason: Grounded reasoning in videos with language models [5.3445140425713245]
Multi-temporal language models (LM) have recently shown promising performance in high-level reasoning tasks on videos. We propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, tracking, to endow the model with the required low-level visual capabilities. We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets.
arXiv Detail & Related papers (2023-06-30T16:31:14Z)
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models. Our empirical observations suggest that vision-and-language models are better at label prediction tasks. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z)
VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering [79.22069768972207]
We propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations. Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context. On two challenging VQA tasks, our method outperforms strong baseline VQA methods by 3.2% on VCR and 4.6% on GQA, suggesting its strength in performing concept-level reasoning.
arXiv Detail & Related papers (2022-05-23T17:55:34Z)
A Simple Long-Tailed Recognition Baseline via Vision-Language Model [92.2866546058082]
The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems. Recent advances in contrastive visual-language pretraining shed light on a new pathway for visual recognition. We propose BALLAD to leverage contrastive vision-language models for long-tailed recognition.
arXiv Detail & Related papers (2021-11-29T17:49:24Z)
Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering [27.042604046441426]
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image. In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views. We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol. We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
arXiv Detail & Related papers (2020-08-31T23:25:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.