PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability
- URL: http://arxiv.org/abs/2503.08481v2
- Date: Thu, 13 Mar 2025 11:19:12 GMT
- Title: PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability
- Authors: Weijie Zhou, Manli Tao, Chaoyang Zhao, Haiyun Guo, Honghui Dong, Ming Tang, Jinqiao Wang,
- Abstract summary: We propose a unified representation of physical reachability across diverse robots, i.e., Space-Physical Reachability Map (S-P Map)<n> PhysVLM is a vision-language model that integrates this reachability information into visual reasoning.
- Score: 31.532470258146073
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding the environment and a robot's physical reachability is crucial for task execution. While state-of-the-art vision-language models (VLMs) excel in environmental perception, they often generate inaccurate or impractical responses in embodied visual reasoning tasks due to a lack of understanding of robotic physical reachability. To address this issue, we propose a unified representation of physical reachability across diverse robots, i.e., Space-Physical Reachability Map (S-P Map), and PhysVLM, a vision-language model that integrates this reachability information into visual reasoning. Specifically, the S-P Map abstracts a robot's physical reachability into a generalized spatial representation, independent of specific robot configurations, allowing the model to focus on reachability features rather than robot-specific parameters. Subsequently, PhysVLM extends traditional VLM architectures by incorporating an additional feature encoder to process the S-P Map, enabling the model to reason about physical reachability without compromising its general vision-language capabilities. To train and evaluate PhysVLM, we constructed a large-scale multi-robot dataset, Phys100K, and a challenging benchmark, EQA-phys, which includes tasks for six different robots in both simulated and real-world environments. Experimental results demonstrate that PhysVLM outperforms existing models, achieving a 14\% improvement over GPT-4o on EQA-phys and surpassing advanced embodied VLMs such as RoboMamba and SpatialVLM on the RoboVQA-val and OpenEQA benchmarks. Additionally, the S-P Map shows strong compatibility with various VLMs, and its integration into GPT-4o-mini yields a 7.1\% performance improvement.
Related papers
- Unreal Robotics Lab: A High-Fidelity Robotics Simulator with Advanced Physics and Rendering [4.760567755149477]
This paper presents a novel simulation framework that integrates the Unreal Engine's advanced rendering capabilities with MuJoCo's high-precision physics simulation.
Our approach enables realistic robotic perception while maintaining accurate physical interactions.
We benchmark visual navigation and SLAM methods within our framework, demonstrating its utility for testing real-world robustness in controlled yet diverse scenarios.
arXiv Detail & Related papers (2025-04-19T01:54:45Z) - Taccel: Scaling Up Vision-based Tactile Robotics via High-performance GPU Simulation [50.34179054785646]
We present Taccel, a high-performance simulation platform that integrates IPC and ABD to model robots, tactile sensors, and objects with both accuracy and unprecedented speed.
Taccel provides precise physics simulation and realistic tactile signals while supporting flexible robot-sensor configurations through user-friendly APIs.
These capabilities position Taccel as a powerful tool for scaling up tactile robotics research and development.
arXiv Detail & Related papers (2025-04-17T12:57:11Z) - PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding [21.91860938879665]
We show that Vision-Language Models (VLMs) excel in common-sense reasoning, but struggle with understanding the physical world.<n>We introduce PhysAgent, a framework that combines the generalization strengths of VLMs with the specialized expertise of vision models.<n>Our results show that enhancing VLMs' physical world understanding capabilities can help embodied agents such as MOKA.
arXiv Detail & Related papers (2025-01-27T18:59:58Z) - TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies [95.30717188630432]
We introduce visual trace prompting to facilitate VLA models' spatial-temporal awareness for action prediction.
We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories.
We present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset.
arXiv Detail & Related papers (2024-12-13T18:40:51Z) - Identifying Terrain Physical Parameters from Vision -- Towards Physical-Parameter-Aware Locomotion and Navigation [33.10872127224328]
We propose a cross-modal self-supervised learning framework for vision-based environmental physical parameter estimation.
We train a physical decoder in simulation to predict friction and stiffness from multi-modal input.
The trained network allows the labeling of real-world images with physical parameters in a self-supervised manner to further train a visual network during deployment.
arXiv Detail & Related papers (2024-08-29T14:35:14Z) - PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large
Multimodal Models [58.33913881592706]
Humans can easily apply their intuitive physics to grasp skillfully and change grasps efficiently, even for objects they have never seen before.
This work delves into infusing such physical commonsense reasoning into robotic manipulation.
We introduce PhyGrasp, a multimodal large model that leverages inputs from two modalities: natural language and 3D point clouds.
arXiv Detail & Related papers (2024-02-26T18:57:52Z) - ContPhy: Continuum Physical Concept Learning and Reasoning from Videos [86.63174804149216]
ContPhy is a novel benchmark for assessing machine physical commonsense.
We evaluated a range of AI models and found that they still struggle to achieve satisfactory performance on ContPhy.
We also introduce an oracle model (ContPRO) that marries the particle-based physical dynamic models with the recent large language models.
arXiv Detail & Related papers (2024-02-09T01:09:21Z) - DiffuseBot: Breeding Soft Robots With Physics-Augmented Generative
Diffusion Models [102.13968267347553]
We present DiffuseBot, a physics-augmented diffusion model that generates soft robot morphologies capable of excelling in a wide spectrum of tasks.
We showcase a range of simulated and fabricated robots along with their capabilities.
arXiv Detail & Related papers (2023-11-28T18:58:48Z) - Physically Grounded Vision-Language Models for Robotic Manipulation [59.143640049407104]
We propose PhysObjects, an object-centric dataset of 39.6K crowd-sourced and 417K automated physical concept annotations.
We show that fine-tuning a vision-language model on PhysObjects improves its understanding of physical object concepts.
We incorporate this physically grounded VLM in an interactive framework with a large language model-based robotic planner.
arXiv Detail & Related papers (2023-09-05T20:21:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.