PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large
Multimodal Models
- URL: http://arxiv.org/abs/2402.16836v1
- Date: Mon, 26 Feb 2024 18:57:52 GMT
- Title: PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large
Multimodal Models
- Authors: Dingkun Guo, Yuqi Xiang, Shuqi Zhao, Xinghao Zhu, Masayoshi Tomizuka,
Mingyu Ding, Wei Zhan
- Abstract summary: Humans can easily apply their intuitive physics to grasp skillfully and change grasps efficiently, even for objects they have never seen before.
This work delves into infusing such physical commonsense reasoning into robotic manipulation.
We introduce PhyGrasp, a multimodal large model that leverages inputs from two modalities: natural language and 3D point clouds.
- Score: 58.33913881592706
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Robotic grasping is a fundamental aspect of robot functionality, defining how
robots interact with objects. Despite substantial progress, its
generalizability to counter-intuitive or long-tailed scenarios, such as objects
with uncommon materials or shapes, remains a challenge. In contrast, humans can
easily apply their intuitive physics to grasp skillfully and change grasps
efficiently, even for objects they have never seen before.
This work delves into infusing such physical commonsense reasoning into
robotic manipulation. We introduce PhyGrasp, a multimodal large model that
leverages inputs from two modalities: natural language and 3D point clouds,
seamlessly integrated through a bridge module. The language modality exhibits
robust reasoning capabilities concerning the impacts of diverse physical
properties on grasping, while the 3D modality comprehends object shapes and
parts. With these two capabilities, PhyGrasp is able to accurately assess the
physical properties of object parts and determine optimal grasping poses.
Additionally, the model's language comprehension enables human instruction
interpretation, generating grasping poses that align with human preferences. To
train PhyGrasp, we construct a dataset PhyPartNet with 195K object instances
with varying physical properties and human preferences, alongside their
corresponding language descriptions. Extensive experiments conducted in the
simulation and on the real robots demonstrate that PhyGrasp achieves
state-of-the-art performance, particularly in long-tailed cases, e.g., about
10% improvement in success rate over GraspNet. Project page:
https://sites.google.com/view/phygrasp
Related papers
- DiffuseBot: Breeding Soft Robots With Physics-Augmented Generative
Diffusion Models [102.13968267347553]
We present DiffuseBot, a physics-augmented diffusion model that generates soft robot morphologies capable of excelling in a wide spectrum of tasks.
We showcase a range of simulated and fabricated robots along with their capabilities.
arXiv Detail & Related papers (2023-11-28T18:58:48Z) - Anthropomorphic Grasping with Neural Object Shape Completion [20.952799332420195]
Humans exhibit extraordinary dexterity when handling objects.
Hand postures commonly demonstrate the influence of specific regions on objects that need to be grasped.
In this work, we leverage human-like object understanding by reconstructing and completing their full geometry from partial observations.
arXiv Detail & Related papers (2023-11-04T21:05:26Z) - Physically Grounded Vision-Language Models for Robotic Manipulation [59.143640049407104]
We propose PhysObjects, an object-centric dataset of 39.6K crowd-sourced and 417K automated physical concept annotations.
We show that fine-tuning a vision-language model on PhysObjects improves its understanding of physical object concepts.
We incorporate this physically grounded VLM in an interactive framework with a large language model-based robotic planner.
arXiv Detail & Related papers (2023-09-05T20:21:03Z) - VoxPoser: Composable 3D Value Maps for Robotic Manipulation with
Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation.
We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects.
We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z) - Learning 6-DoF Fine-grained Grasp Detection Based on Part Affordance Grounding [42.04502185508723]
We propose a new large Language-guided SHape grAsPing datasEt to promote 3D part-level affordance and grasping ability learning.
From the perspective of robotic cognition, we design a two-stage fine-grained robotic grasping framework (named LangPartGPD)
Our method combines the advantages of human-robot collaboration and large language models (LLMs)
Results show our method achieves competitive performance in 3D geometry fine-grained grounding, object affordance inference, and 3D part-aware grasping tasks.
arXiv Detail & Related papers (2023-01-27T07:00:54Z) - LaTTe: Language Trajectory TransformEr [33.7939079214046]
This work proposes a flexible language-based framework to modify generic 3D robotic trajectories.
We employ an auto-regressive transformer to map natural language inputs and contextual images into changes in 3D trajectories.
We show through simulations and real-life experiments that the model can successfully follow human intent.
arXiv Detail & Related papers (2022-08-04T22:43:21Z) - Dynamic Visual Reasoning by Learning Differentiable Physics Models from
Video and Language [92.7638697243969]
We propose a unified framework that can jointly learn visual concepts and infer physics models of objects from videos and language.
This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine.
arXiv Detail & Related papers (2021-10-28T17:59:13Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Language Grounding with 3D Objects [60.67796160959387]
We introduce a novel reasoning task that targets both visual and non-visual language about 3D objects.
We introduce several CLIP-based models for distinguishing objects.
We find that adding view estimation to language grounding models improves accuracy on both SNARE and when identifying objects referred to in language on a robot platform.
arXiv Detail & Related papers (2021-07-26T23:35:58Z) - Physion: Evaluating Physical Prediction from Vision in Humans and
Machines [46.19008633309041]
We present a visual and physical prediction benchmark that precisely measures this capability.
We compare an array of algorithms on their ability to make diverse physical predictions.
We find that graph neural networks with access to the physical state best capture human behavior.
arXiv Detail & Related papers (2021-06-15T16:13:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.