KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models
- URL: http://arxiv.org/abs/2407.17773v3
- Date: Wed, 05 Mar 2025 03:07:12 GMT
- Title: KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models
- Authors: Eunice Yiu, Maan Qraitem, Anisa Noor Majhi, Charlie Wong, Yutong Bai, Shiry Ginosar, Alison Gopnik, Kate Saenko,
- Abstract summary: This paper investigates visual analogical reasoning in large multimodal models (LMMs) compared to human adults and children.<n>We propose a new benchmark of 4,300 visual transformations of everyday objects to test LMMs on visual analogical reasoning.
- Score: 43.86823330035457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates visual analogical reasoning in large multimodal models (LMMs) compared to human adults and children. A "visual analogy" is an abstract rule inferred from one image and applied to another. While benchmarks exist for testing visual reasoning in LMMs, they require advanced skills and omit basic visual analogies that even young children can make. Inspired by developmental psychology, we propose a new benchmark of 4,300 visual transformations of everyday objects to test LMMs on visual analogical reasoning and compare them to children (ages three to five) and to adults. We structure the evaluation into three stages: identifying what changed (e.g., color, number, etc.), how it changed (e.g., added one object), and applying the rule to new scenarios. Our findings show that while GPT-o1, GPT-4V, LLaVA-1.5, and MANTIS identify the "what" effectively, they struggle with quantifying the "how" and extrapolating this rule to new objects. In contrast, children and adults exhibit much stronger analogical reasoning at all three stages. Additionally, the strongest tested model, GPT-o1, performs better in tasks involving simple surface-level visual attributes like color and size, correlating with quicker human adult response times. Conversely, more complex tasks such as number, rotation, and reflection, which necessitate extensive cognitive processing and understanding of extrinsic spatial properties in the physical world, present more significant challenges. Altogether, these findings highlight the limitations of training models on data that primarily consists of 2D images and text.
Related papers
- KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old? [79.27736230305516]
We introduce KidVis, a novel benchmark grounded in the theory of human visual development.<n> evaluating 20 state-of-the-art MLLMs against a human physiological baseline reveals a stark performance disparity.<n>This study confirms that current MLLMs, despite their reasoning prowess, lack the essential physiological perceptual primitives required for generalized visual intelligence.
arXiv Detail & Related papers (2026-01-13T07:32:50Z) - Visual Room 2.0: Seeing is Not Understanding for MLLMs [9.870930749379932]
We introduce textitVisual Room 2.0, a hierarchical benchmark for evaluating perception-cognition alignment of MLLMs.<n>We model human perceptive and cognitive processes across three levels: low, middle, and high, covering 17 representative tasks.<n>The dataset contains 350 multi-modal samples, each with six progressive questions (2,100 in total) spanning perception to cognition.
arXiv Detail & Related papers (2025-11-17T03:34:52Z) - Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models? [5.726418224480853]
Vision-language models (VLMs) do not outperform text-only models in either task.<n>VLMs perform significantly worse in the visual dimension compared to other sensory dimensions.<n>Our findings underscore the need for more effective integration of embodied knowledge in multimodal language models.
arXiv Detail & Related papers (2025-10-19T16:43:04Z) - MiCo: Multi-image Contrast for Reinforcement Visual Reasoning [72.81576836419373]
Chain-of-Thought (CoT) reasoning can be used to link visual cues across multiple images.<n>We adapt rule-based reinforcement learning for Vision-Language Models (VLMs)<n>Our method achieves significant improvements on multi-image reasoning benchmarks and shows strong performance on general vision tasks.
arXiv Detail & Related papers (2025-06-27T17:59:27Z) - Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations [61.235500325327585]
Existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation.<n>We introduce STARE, a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through visual simulation.<n>Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks.
arXiv Detail & Related papers (2025-06-05T05:09:46Z) - Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT [24.085953089267772]
We show how OpenAI o3 and GPT-4o fail to grasp basic physical laws, spatial interactions, and causal effects in complex scenes.<n>We introduce MVPBench, a benchmark designed to rigorously evaluate visual physical reasoning through the lens of visual chain-of-thought (CoT)<n> Experimental results reveal a concerning trend: even cutting-edge MLLMs exhibit poor visual reasoning accuracy and weak image-text alignment in physical domains.
arXiv Detail & Related papers (2025-05-30T03:48:59Z) - Computer Vision Models Show Human-Like Sensitivity to Geometric and Topological Concepts [1.935452308279137]
We investigate computer vision models and human sensitivity to geometric and topological (GT) concepts.<n>We do so using computer visions models, which are trained on large image datasets.<n> Transformer-based models achieve the highest overall accuracy, surpassing that of young children.
arXiv Detail & Related papers (2025-05-19T16:04:53Z) - LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception [105.78609483419115]
We introduce LongPerceptualThoughts, a new synthetic dataset with 30K long-thought traces for perceptual tasks.
We propose a novel three-stage data synthesis framework that first synthesizes verifiable multiple-choice questions.
We demonstrate notable improvements over existing visual reasoning data-generation methods.
arXiv Detail & Related papers (2025-04-21T18:10:38Z) - VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.
We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.
We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z) - Evaluating Multiview Object Consistency in Humans and Image Models [68.36073530804296]
We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape.
We collect 35K trials of behavioral data from over 500 participants.
We then evaluate the performance of common vision models.
arXiv Detail & Related papers (2024-09-09T17:59:13Z) - What is the Visual Cognition Gap between Humans and Multimodal LLMs? [63.81347276258992]
We evaluate the visual cognition capability of Multimodal Large Language Models (MLLMs) and compare their performance with human visual cognition studies.<n>Our comparative experiments with different baselines reveal a gap between MLLMs and human intelligence.<n>We believe that the public release of MaRs-VQA and the Qwen2-VCog baseline model will drive progress toward the next generation of MLLMs with human-like visual cognition abilities.
arXiv Detail & Related papers (2024-06-14T22:02:21Z) - Visually Descriptive Language Model for Vector Graphics Reasoning [76.42082386029206]
We propose the Visually Descriptive Language Model (VDLM) to bridge the gap between low-level visual perception and high-level language reasoning.
We show that VDLM significantly improves state-of-the-art LMMs like GPT-4o on various multimodal perception and reasoning tasks.
arXiv Detail & Related papers (2024-04-09T17:30:18Z) - WinoViz: Probing Visual Properties of Objects Under Different States [39.92628807477848]
We present a text-only evaluation dataset consisting of 1,380 examples that probe the reasoning abilities of language models regarding variant visual properties of objects under different contexts or states.
Our task is challenging since it requires pragmatic reasoning (finding intended meanings) and visual knowledge reasoning.
We also present multi-hop data, a more challenging version of our data, which requires multi-step reasoning chains to solve our task.
arXiv Detail & Related papers (2024-02-21T07:31:47Z) - Does Conceptual Representation Require Embodiment? Insights From Large
Language Models [9.390117546307042]
We compare representations of 4,442 lexical concepts between humans and ChatGPTs (GPT-3.5 and GPT-4)
We identify two main findings: 1) Both models strongly align with human representations in non-sensorimotor domains but lag in sensory and motor areas, with GPT-4 outperforming GPT-3.5; 2) GPT-4's gains are associated with its additional visual learning, which also appears to benefit related dimensions like haptics and imageability.
arXiv Detail & Related papers (2023-05-30T15:06:28Z) - 3D Concept Learning and Reasoning from Multi-View Images [96.3088005719963]
We introduce a new large-scale benchmark for 3D multi-view visual question answering (3DMV-VQA)
This dataset consists of approximately 5k scenes, 600k images, paired with 50k questions.
We propose a novel 3D concept learning and reasoning framework that seamlessly combines neural fields, 2D pre-trained vision-language models, and neural reasoning operators.
arXiv Detail & Related papers (2023-03-20T17:59:49Z) - Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark [80.79082788458602]
We provide a new multi-task benchmark for evaluating text-to-image models.
We compare the most common open-source (Stable Diffusion) and commercial (DALL-E 2) models.
Twenty computer science AI graduate students evaluated the two models, on three tasks, at three difficulty levels, across ten prompts each.
arXiv Detail & Related papers (2022-11-22T09:27:53Z) - MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech.
MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets.
On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z) - Multi-Granularity Modularized Network for Abstract Visual Reasoning [15.956555435408557]
We focus on the Raven Progressive Matrices Test, designed to measure cognitive reasoning.
Inspired by cognitive studies, we propose a Multi-Granularity Modularized Network (MMoN) to bridge the gap between the processing of raw sensory information and symbolic reasoning.
arXiv Detail & Related papers (2020-07-09T09:54:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.