Related papers: Vision Language Models See What You Want but not What You See

Vision Language Models See What You Want but not What You See

URL: http://arxiv.org/abs/2410.00324v1
Date: Tue, 1 Oct 2024 01:52:01 GMT
Title: Vision Language Models See What You Want but not What You See
Authors: Qingying Gao, Yijiang Li, Haiyun Lyu, Haoran Sun, Dezhi Luo, Hokin Deng,
Abstract summary: Knowing others' intentions and taking others' perspectives are two core components of human intelligence. In this paper, we investigate intentionality understanding and perspective-taking in Vision Language Models. Surprisingly, we find VLMs achieving high performance on intentionality understanding but lower performance on perspective-taking.
Score: 9.268588981925234
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Knowing others' intentions and taking others' perspectives are two core components of human intelligence that are typically considered to be instantiations of theory-of-mind. Infiltrating machines with these abilities is an important step towards building human-level artificial intelligence. Recently, Li et al. built CogDevelop2K, a data-intensive cognitive experiment benchmark to assess the developmental trajectory of machine intelligence. Here, to investigate intentionality understanding and perspective-taking in Vision Language Models, we leverage the IntentBench and PerspectBench of CogDevelop2K, which contains over 300 cognitive experiments grounded in real-world scenarios and classic cognitive tasks, respectively. Surprisingly, we find VLMs achieving high performance on intentionality understanding but lower performance on perspective-taking. This challenges the common belief in cognitive science literature that perspective-taking at the corresponding modality is necessary for intentionality understanding.

Related papers

Spatial Mental Modeling from Limited Views [71.57140964322559]
Our new MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap.<n>Using MindCube, we evaluate how well Vision Language Models (VLMs) build robust spatial mental models.<n>We then explore three approaches to help VLMs approximate spatial mental models, including unseen intermediate views, natural language reasoning chains, and cognitive maps.
arXiv Detail & Related papers (2025-06-26T16:38:19Z)
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces [90.96731971685115]
VeBrain is a unified framework for perception, reasoning, and control in real world.<n>VeBrain reformulates robotic control into common text-based MLLM tasks in the 2D visual space.<n>VeBrain shows strong adaptability, flexibility, and compositional capabilities compared to existing methods.
arXiv Detail & Related papers (2025-05-30T18:00:34Z)
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning [11.242852367476015]
DeepEyes is a model with "thinking with images" capabilities incentivized through end-to-end reinforcement learning.<n>We propose a tool-use-oriented data selection mechanism and a reward strategy to encourage successful tool-assisted reasoning trajectories.<n>DeepEyes achieves significant performance gains on fine-grained perception and reasoning benchmarks.
arXiv Detail & Related papers (2025-05-20T13:48:11Z)
Core Knowledge Deficits in Multi-Modal Language Models [8.461561516444261]
We examine the hypothesis that deficiencies stem from the absence of core knowledge innate to humans from early childhood. Our findings reveal core knowledge deficits in early developed core abilities while models demonstrate human comparable performance in high level cognition. We introduce an evaluation technique, Concept Hacking, through which we demonstrate that MLLMs do not genuinely advance toward core knowledge.
arXiv Detail & Related papers (2024-10-06T20:13:11Z)
Probing Mechanical Reasoning in Large Vision Language Models [9.268588981925234]
Mechanical reasoning allows us to design tools, build bridges and canals, and construct houses which set the foundation of human civilization. We leverage the MechBench of CogDevelop2K to test understanding of mechanical system stability, gears and pulley systems, seesaw-like systems and leverage principle, inertia and motion.
arXiv Detail & Related papers (2024-10-01T01:33:10Z)
Visual Knowledge in the Big Model Era: Retrospect and Prospect [63.282425615863]
Visual knowledge is a new form of knowledge representation that can encapsulate visual concepts and their relations in a succinct, comprehensive, and interpretable manner. As the knowledge about the visual world has been identified as an indispensable component of human cognition and intelligence, visual knowledge is poised to have a pivotal role in establishing machine intelligence.
arXiv Detail & Related papers (2024-04-05T07:31:24Z)
Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models [71.93366651585275]
Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks. We propose Visualization-of-Thought (VoT) to elicit spatial reasoning of LLMs by visualizing their reasoning traces. VoT significantly enhances the spatial reasoning abilities of LLMs.
arXiv Detail & Related papers (2024-04-04T17:45:08Z)
Brain in a Vat: On Missing Pieces Towards Artificial General Intelligence in Large Language Models [83.63242931107638]
We propose four characteristics of generally intelligent agents. We argue that active engagement with objects in the real world delivers more robust signals for forming conceptual representations. We conclude by outlining promising future research directions in the field of artificial general intelligence.
arXiv Detail & Related papers (2023-07-07T13:58:16Z)
Machine Psychology [54.287802134327485]
We argue that a fruitful direction for research is engaging large language models in behavioral experiments inspired by psychology. We highlight theoretical perspectives, experimental paradigms, and computational analysis techniques that this approach brings to the table. It paves the way for a "machine psychology" for generative artificial intelligence (AI) that goes beyond performance benchmarks.
arXiv Detail & Related papers (2023-03-24T13:24:41Z)
Beyond Interpretable Benchmarks: Contextual Learning through Cognitive and Multimodal Perception [0.0]
This study contends that the Turing Test is misinterpreted as an attempt to anthropomorphize computer systems. It emphasizes tacit learning as a cornerstone of general-purpose intelligence, despite its lack of overt interpretability.
arXiv Detail & Related papers (2022-12-04T08:30:04Z)
EgoTaskQA: Understanding Human Tasks in Egocentric Videos [89.9573084127155]
EgoTaskQA benchmark provides home for crucial dimensions of task understanding through question-answering on real-world egocentric videos. We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others. We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos.
arXiv Detail & Related papers (2022-10-08T05:49:05Z)
A World-Self Model Towards Understanding Intelligence [0.0]
We will compare human and artificial intelligence, and propose that a certain aspect of human intelligence is the key to connect perception and cognition. We will present the broader idea of "concept", the principles and mathematical frameworks of the new model World-Self Model (WSM) of intelligence, and finally an unified general framework of intelligence based on WSM.
arXiv Detail & Related papers (2022-03-25T16:42:23Z)
Building Human-like Communicative Intelligence: A Grounded Perspective [1.0152838128195465]
After making astounding progress in language learning, AI systems seem to approach the ceiling that does not reflect important aspects of human communicative capacities. This paper suggests that the dominant cognitively-inspired AI directions, based on nativist and symbolic paradigms, lack necessary substantiation and concreteness to guide progress in modern AI. I propose a list of concrete, implementable components for building "grounded" linguistic intelligence.
arXiv Detail & Related papers (2022-01-02T01:43:24Z)
Visual Perspective Taking for Opponent Behavior Modeling [22.69165968663182]
We propose an end-to-end long-term visual prediction framework for robots. We demonstrate our approach in the context of visual hide-and-seek. We suggest that visual behavior modeling and perspective taking skills will play a critical role in the ability of physical robots to fully integrate into real-world multi-agent activities.
arXiv Detail & Related papers (2021-05-11T16:02:32Z)
Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning [78.13740873213223]
Bongard problems (BPs) were introduced as an inspirational challenge for visual cognition in intelligent systems. We propose a new benchmark Bongard-LOGO for human-level concept learning and reasoning.
arXiv Detail & Related papers (2020-10-02T03:19:46Z)
Machine Common Sense [77.34726150561087]
Machine common sense remains a broad, potentially unbounded problem in artificial intelligence (AI) This article deals with the aspects of modeling commonsense reasoning focusing on such domain as interpersonal interactions.
arXiv Detail & Related papers (2020-06-15T13:59:47Z)
Dark, Beyond Deep: A Paradigm Shift to Cognitive AI with Humanlike Common Sense [142.53911271465344]
We argue that the next generation of AI must embrace "dark" humanlike common sense for solving novel tasks. We identify functionality, physics, intent, causality, and utility (FPICU) as the five core domains of cognitive AI with humanlike common sense.
arXiv Detail & Related papers (2020-04-20T04:07:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.