Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language
Models
- URL: http://arxiv.org/abs/2312.04533v1
- Date: Thu, 7 Dec 2023 18:51:19 GMT
- Title: Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language
Models
- Authors: Ivan Kapelyukh, Yifei Ren, Ignacio Alzugaray, Edward Johns
- Abstract summary: We introduce Dream2Real, a robotics framework which integrates vision-language models (VLMs) trained on 2D data into a 3D object rearrangement pipeline.
This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered.
These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world with pick-and-place.
- Score: 14.163489368617379
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Dream2Real, a robotics framework which integrates
vision-language models (VLMs) trained on 2D data into a 3D object rearrangement
pipeline. This is achieved by the robot autonomously constructing a 3D
representation of the scene, where objects can be rearranged virtually and an
image of the resulting arrangement rendered. These renders are evaluated by a
VLM, so that the arrangement which best satisfies the user instruction is
selected and recreated in the real world with pick-and-place. This enables
language-conditioned rearrangement to be performed zero-shot, without needing
to collect a training dataset of example arrangements. Results on a series of
real-world tasks show that this framework is robust to distractors,
controllable by language, capable of understanding complex multi-object
relations, and readily applicable to both tabletop and 6-DoF rearrangement
tasks.
Related papers
- SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions.
We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z) - 3D-VLA: A 3D Vision-Language-Action Generative World Model [68.0388311799959]
Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world.
We propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action.
Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments.
arXiv Detail & Related papers (2024-03-14T17:58:41Z) - DynaVol: Unsupervised Learning for Dynamic Scenes through Object-Centric
Voxelization [67.85434518679382]
We present DynaVol, a 3D scene generative model that unifies geometric structures and object-centric learning.
The key idea is to perform object-centric voxelization to capture the 3D nature of the scene.
voxel features evolve over time through a canonical-space deformation function, forming the basis for global representation learning.
arXiv Detail & Related papers (2023-04-30T05:29:28Z) - HM3D-ABO: A Photo-realistic Dataset for Object-centric Multi-view 3D
Reconstruction [37.29140654256627]
We present a photo-realistic object-centric dataset HM3D-ABO.
It is constructed by composing realistic indoor scene and realistic object.
The dataset could also be useful for tasks such as camera pose estimation and novel-view synthesis.
arXiv Detail & Related papers (2022-06-24T16:02:01Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.