Related papers: VisualActBench: Can VLMs See and Act like a Human?

VisualActBench: Can VLMs See and Act like a Human?

URL: http://arxiv.org/abs/2512.09907v1
Date: Wed, 10 Dec 2025 18:36:18 GMT
Title: VisualActBench: Can VLMs See and Act like a Human?
Authors: Daoan Zhang, Pai Liu, Xiaofei Zhou, Yuan Ge, Guangchen Lan, Jing Bi, Christopher Brinton, Ehsan Hoque, Jiebo Luo,
Abstract summary: Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments.<n>However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored.<n>We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions.
Score: 47.16421650715271
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments. However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored. We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions across four real-world scenarios. Each action is labeled with an Action Prioritization Level (APL) and a proactive-reactive type to assess models' human-aligned reasoning and value sensitivity. We evaluate 29 VLMs on VisualActBench and find that while frontier models like GPT4o demonstrate relatively strong performance, a significant gap remains compared to human-level reasoning, particularly in generating proactive, high-priority actions. Our results highlight limitations in current VLMs' ability to interpret complex context, anticipate outcomes, and align with human decision-making frameworks. VisualActBench establishes a comprehensive foundation for assessing and improving the real-world readiness of proactive, vision-centric AI agents.

Related papers

PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments [36.84821207878773]
Visual reasoning in multimodal large language models (MLLMs) has primarily been studied in static, fully observable settings.<n>We introduce the Active Visual Reasoning (AVR) task, extending visual reasoning to partially observable, interactive environments.<n>We present a benchmark featuring multi-round interactive environments designed to assess both reasoning and information-gathering efficiency.
arXiv Detail & Related papers (2025-10-24T02:59:00Z)
Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models [63.69856480318313]
AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment.<n>We show that AGILE substantially boosts performance on jigsaw tasks of varying complexity.<n>We also demonstrate strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%.
arXiv Detail & Related papers (2025-10-01T17:58:05Z)
Learning to See and Act: Task-Aware View Planning for Robotic Manipulation [88.37482534484627]
Task-Aware View Planning (TAVP) is a framework designed to integrate active view planning with task-specific representation learning.<n>Our proposed TAVP model achieves superior performance over state-of-the-art fixed-view approaches.
arXiv Detail & Related papers (2025-08-07T09:21:20Z)
VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments [25.534332634912005]
We introduce Visual Strategic Bench (VS-Bench), a benchmark that evaluates Vision Language Models for strategic abilities in multi-agent environments.<n>The performance of VLM agents is evaluated across three dimensions: perception measured by element recognition accuracy; strategic reasoning measured by next-action prediction accuracy; and decision-making measured by normalized episode return.
arXiv Detail & Related papers (2025-06-03T02:57:38Z)
Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO [63.140883026848286]
Active vision refers to the process of actively selecting where and how to look in order to gather task-relevant information.<n>Recently, the use of Multimodal Large Language Models (MLLMs) as central planning and decision-making modules in robotic systems has gained extensive attention.
arXiv Detail & Related papers (2025-05-27T17:29:31Z)
Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models [81.08295968057453]
We present IVE, an agentic exploration framework inspired by human curiosity.<n>We evaluate IVE in both simulated and real-world tabletop environments.
arXiv Detail & Related papers (2025-05-12T17:59:11Z)
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models [84.27290155010533]
We introduce Vision-centric Multiple Abilities Game Evaluation (V-MAGE), a novel game-based evaluation framework.<n>V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios.<n>We show V-MAGE provides actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings.
arXiv Detail & Related papers (2025-04-08T15:43:01Z)
A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs [3.2228025627337864]
This paper introduces a structured evaluation framework to dissect the perception-reasoning interface in Vision-Language Models (VLMs)<n>We propose three distinct evaluation paradigms, mirroring human problem-solving strategies.<n>Applying this framework, we demonstrate that CA, leveraging powerful language models for reasoning over rich, independently generated descriptions, achieves new state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2025-01-23T12:42:42Z)
ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models [18.992215985625492]
We evaluate active perception in Multimodal Large Language Models (MLLMs)<n>We focus on a specialized form of Visual Question Answering (VQA) that eases and quantifies the evaluation yet challenging for existing MLLMs.<n>We observe that restricted perceptual fields play a significant role in enabling active perception.
arXiv Detail & Related papers (2024-10-07T00:16:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.