PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?
- URL: http://arxiv.org/abs/2506.23725v1
- Date: Mon, 30 Jun 2025 10:58:36 GMT
- Title: PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?
- Authors: Atharva Gundawar, Som Sagar, Ransalu Senanayake,
- Abstract summary: PAC Bench is a benchmark designed to evaluate vision-language models (VLMs) on their understanding of core Properties, Affordances, and Constraints (PAC)<n>Our evaluations reveal significant gaps in the ability of current VLMs to grasp fundamental physical concepts, highlighting limitations in their suitability for reliable robot manipulation.
- Score: 7.736445799116692
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models (VLMs) are increasingly pivotal for generalist robot manipulation, enabling tasks such as physical reasoning, policy generation, and failure detection. However, their proficiency in these high-level applications often assumes a deep understanding of low-level physical prerequisites, a capability that remains largely unverified. For robots to perform actions reliably, they must comprehend intrinsic object properties (e.g., material, weight), action affordances (e.g., graspable, stackable), and physical constraints (e.g., stability, reachability, or an object's state, such as being closed). Despite the widespread use of VLMs in manipulation tasks, we argue that off-the-shelf models may lack this granular, physically grounded understanding, as such prerequisites are often overlooked during training. To address this critical gap, we introduce PAC Bench, a comprehensive benchmark designed to systematically evaluate VLMs on their understanding of core Properties, Affordances, and Constraints (PAC) from a task executability perspective. PAC Bench features a diverse dataset with over 30,000 annotations, comprising 673 real-world images (115 object classes, 15 property types, and 1 to 3 affordances defined per class), 100 real-world humanoid-view scenarios, and 120 unique simulated constraint scenarios across four tasks. Our evaluations reveal significant gaps in the ability of current VLMs to grasp fundamental physical concepts, highlighting limitations in their suitability for reliable robot manipulation and pointing to key areas for targeted research. PAC Bench also serves as a standardized benchmark for rigorously evaluating physical reasoning in VLMs and guiding the development of more robust, physically grounded models for robotic applications. Project Page: https://pacbench.github.io/
Related papers
- PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly [77.33429729761596]
We introduce PhyBlock, a progressive benchmark to assess vision-language models (VLMs) on physical understanding and planning.<n>PhyBlock integrates a novel four-level cognitive hierarchy assembly task alongside targeted Visual Question Answering (VQA) samples.<n>We benchmark 21 state-of-the-art VLMs, highlighting their strengths and limitations in physically grounded, multi-step planning.
arXiv Detail & Related papers (2025-06-10T11:46:06Z) - PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability [31.532470258146073]
We propose a unified representation of physical reachability across diverse robots, i.e., Space-Physical Reachability Map (S-P Map)<n> PhysVLM is a vision-language model that integrates this reachability information into visual reasoning.
arXiv Detail & Related papers (2025-03-11T14:34:41Z) - Afford-X: Generalizable and Slim Affordance Reasoning for Task-oriented Manipulation [29.541362796943837]
We introduce LVIS-Aff, a large-scale dataset comprising 1,496 tasks and 119k images, designed to enhance the generalizability of affordance reasoning from perception.<n>We develop Afford-X, an end-to-end trainable affordance reasoning model that incorporates Verbizable Attention and Bi-Fusion modules.<n>Our work demonstrates the potential for efficient, general affordance reasoning models that can be deployed on local devices for task-oriented manipulations.
arXiv Detail & Related papers (2025-03-05T14:44:53Z) - Keypoint Abstraction using Large Models for Object-Relative Imitation Learning [78.92043196054071]
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics.
Keypoint-based representations have been proven effective as a succinct representation for essential object capturing features.
We propose KALM, a framework that leverages large pre-trained vision-language models to automatically generate task-relevant and cross-instance consistent keypoints.
arXiv Detail & Related papers (2024-10-30T17:37:31Z) - Robotic Control via Embodied Chain-of-Thought Reasoning [86.6680905262442]
Key limitation of learned robot control policies is their inability to generalize outside their training data.<n>Recent works on vision-language-action models (VLAs) have shown that the use of large, internet pre-trained vision-language models can substantially improve their robustness and generalization ability.<n>We introduce Embodied Chain-of-Thought Reasoning (ECoT) for VLAs, in which we train VLAs to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features before predicting the robot action.
arXiv Detail & Related papers (2024-07-11T17:31:01Z) - RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents [105.13169239919272]
We propose RH20T-P, a primitive-level robotic manipulation dataset.<n>It contains about 38k video clips covering 67 diverse manipulation tasks in real-world scenarios.<n>We standardize a plan-execute CGA paradigm and implement an exemplar baseline called RA-P on our RH20T-P.
arXiv Detail & Related papers (2024-03-28T17:42:54Z) - ManiPose: A Comprehensive Benchmark for Pose-aware Object Manipulation in Robotics [55.85916671269219]
This paper introduces ManiPose, a pioneering benchmark designed to advance the study of pose-varying manipulation tasks.
A comprehensive dataset features geometrically consistent and manipulation-oriented 6D pose labels for 2936 real-world scanned rigid objects and 100 articulated objects.
Our benchmark demonstrates notable advancements in pose estimation, pose-aware manipulation, and real-robot skill transfer.
arXiv Detail & Related papers (2024-03-20T07:48:32Z) - Learning Environment-Aware Affordance for 3D Articulated Object
Manipulation under Occlusions [9.400505355134728]
We propose an environment-aware affordance framework that incorporates both object-level actionable priors and environment constraints.
We introduce a novel contrastive affordance learning framework capable of training on scenes containing a single occluder and generalizing to scenes with complex occluder combinations.
arXiv Detail & Related papers (2023-09-14T08:24:32Z) - Physically Grounded Vision-Language Models for Robotic Manipulation [59.143640049407104]
We propose PhysObjects, an object-centric dataset of 39.6K crowd-sourced and 417K automated physical concept annotations.
We show that fine-tuning a vision-language model on PhysObjects improves its understanding of physical object concepts.
We incorporate this physically grounded VLM in an interactive framework with a large language model-based robotic planner.
arXiv Detail & Related papers (2023-09-05T20:21:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.