Physically Grounded Vision-Language Models for Robotic Manipulation
- URL: http://arxiv.org/abs/2309.02561v4
- Date: Sun, 3 Mar 2024 08:12:36 GMT
- Title: Physically Grounded Vision-Language Models for Robotic Manipulation
- Authors: Jensen Gao, Bidipta Sarkar, Fei Xia, Ted Xiao, Jiajun Wu, Brian
Ichter, Anirudha Majumdar, Dorsa Sadigh
- Abstract summary: We propose PhysObjects, an object-centric dataset of 39.6K crowd-sourced and 417K automated physical concept annotations.
We show that fine-tuning a vision-language model on PhysObjects improves its understanding of physical object concepts.
We incorporate this physically grounded VLM in an interactive framework with a large language model-based robotic planner.
- Score: 59.143640049407104
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in vision-language models (VLMs) have led to improved
performance on tasks such as visual question answering and image captioning.
Consequently, these models are now well-positioned to reason about the physical
world, particularly within domains such as robotic manipulation. However,
current VLMs are limited in their understanding of the physical concepts (e.g.,
material, fragility) of common objects, which restricts their usefulness for
robotic manipulation tasks that involve interaction and physical reasoning
about such objects. To address this limitation, we propose PhysObjects, an
object-centric dataset of 39.6K crowd-sourced and 417K automated physical
concept annotations of common household objects. We demonstrate that
fine-tuning a VLM on PhysObjects improves its understanding of physical object
concepts, including generalization to held-out concepts, by capturing human
priors of these concepts from visual appearance. We incorporate this physically
grounded VLM in an interactive framework with a large language model-based
robotic planner, and show improved planning performance on tasks that require
reasoning about physical object concepts, compared to baselines that do not
leverage physically grounded VLMs. We additionally illustrate the benefits of
our physically grounded VLM on a real robot, where it improves task success
rates. We release our dataset and provide further details and visualizations of
our results at https://iliad.stanford.edu/pg-vlm/.
Related papers
- Which objects help me to act effectively? Reasoning about physically-grounded affordances [0.6291443816903801]
A key aspect of this understanding lies in detecting an object's affordances.
Our approach leverages a dialogue of large language models (LLMs) and vision-language models (VLMs) to achieve open-world affordance detection.
By grounding our system in the physical world, we account for the robot's embodiment and the intrinsic properties of the objects it encounters.
arXiv Detail & Related papers (2024-07-18T11:08:57Z) - Unsupervised Dynamics Prediction with Object-Centric Kinematics [22.119612406160073]
We propose Object-Centric Kinematics (OCK), a framework for dynamics prediction leveraging object-centric representations.
OCK consists of low-level structured states of objects' position, velocity, and acceleration.
Our model demonstrates superior performance when handling objects and backgrounds in complex scenes characterized by a wide range of object attributes and dynamic movements.
arXiv Detail & Related papers (2024-04-29T04:47:23Z) - PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation [62.53760963292465]
PhysDreamer is a physics-based approach that endows static 3D objects with interactive dynamics.
We present our approach on diverse examples of elastic objects and evaluate the realism of the synthesized interactions through a user study.
arXiv Detail & Related papers (2024-04-19T17:41:05Z) - PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large
Multimodal Models [58.33913881592706]
Humans can easily apply their intuitive physics to grasp skillfully and change grasps efficiently, even for objects they have never seen before.
This work delves into infusing such physical commonsense reasoning into robotic manipulation.
We introduce PhyGrasp, a multimodal large model that leverages inputs from two modalities: natural language and 3D point clouds.
arXiv Detail & Related papers (2024-02-26T18:57:52Z) - AffordanceLLM: Grounding Affordance from Vision Language Models [36.97072698640563]
Affordance grounding refers to the task of finding the area of an object with which one can interact.
Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set.
We make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge.
arXiv Detail & Related papers (2024-01-12T03:21:02Z) - WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model [92.90127398282209]
This paper investigates the potential of integrating the most recent Large Language Models (LLMs) and existing visual grounding and robotic grasping system.
We introduce the WALL-E (Embodied Robotic WAiter load lifting with Large Language model) as an example of this integration.
We deploy this LLM-empowered system on the physical robot to provide a more user-friendly interface for the instruction-guided grasping task.
arXiv Detail & Related papers (2023-08-30T11:35:21Z) - Dynamic Visual Reasoning by Learning Differentiable Physics Models from
Video and Language [92.7638697243969]
We propose a unified framework that can jointly learn visual concepts and infer physics models of objects from videos and language.
This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine.
arXiv Detail & Related papers (2021-10-28T17:59:13Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Hindsight for Foresight: Unsupervised Structured Dynamics Models from
Physical Interaction [24.72947291987545]
Key challenge for an agent learning to interact with the world is to reason about physical properties of objects.
We propose a novel approach for modeling the dynamics of a robot's interactions directly from unlabeled 3D point clouds and images.
arXiv Detail & Related papers (2020-08-02T11:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.