DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics
- URL: http://arxiv.org/abs/2210.02438v3
- Date: Thu, 4 May 2023 14:11:50 GMT
- Title: DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics
- Authors: Ivan Kapelyukh, Vitalis Vosylius, Edward Johns
- Abstract summary: We introduce the first work to explore web-scale diffusion models for robotics.
DALL-E-Bot enables a robot to rearrange objects in a scene.
We show that this is possible zero-shot using DALL-E.
- Score: 13.87953637017351
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the first work to explore web-scale diffusion models for
robotics. DALL-E-Bot enables a robot to rearrange objects in a scene, by first
inferring a text description of those objects, then generating an image
representing a natural, human-like arrangement of those objects, and finally
physically arranging the objects according to that goal image. We show that
this is possible zero-shot using DALL-E, without needing any further example
arrangements, data collection, or training. DALL-E-Bot is fully autonomous and
is not restricted to a pre-defined set of objects or scenes, thanks to DALL-E's
web-scale pre-training. Encouraging real-world results, with both human studies
and objective metrics, show that integrating web-scale diffusion models into
robotics pipelines is a promising direction for scalable, unsupervised robot
learning.
Related papers
- Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation.
Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal.
We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z) - SG-Bot: Object Rearrangement via Coarse-to-Fine Robotic Imagination on Scene Graphs [81.15889805560333]
We present SG-Bot, a novel rearrangement framework.
SG-Bot exemplifies lightweight, real-time, and user-controllable characteristics.
Experimental results demonstrate that SG-Bot outperforms competitors by a large margin.
arXiv Detail & Related papers (2023-09-21T15:54:33Z) - WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model [92.90127398282209]
This paper investigates the potential of integrating the most recent Large Language Models (LLMs) and existing visual grounding and robotic grasping system.
We introduce the WALL-E (Embodied Robotic WAiter load lifting with Large Language model) as an example of this integration.
We deploy this LLM-empowered system on the physical robot to provide a more user-friendly interface for the instruction-guided grasping task.
arXiv Detail & Related papers (2023-08-30T11:35:21Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - Open-World Object Manipulation using Pre-trained Vision-Language Models [72.87306011500084]
For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary.
We develop a simple approach, which leverages a pre-trained vision-language model to extract object-identifying information.
In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments.
arXiv Detail & Related papers (2023-03-02T01:55:10Z) - Can Foundation Models Perform Zero-Shot Task Specification For Robot
Manipulation? [54.442692221567796]
Task specification is critical for engagement of non-expert end-users and adoption of personalized robots.
A widely studied approach to task specification is through goals, using either compact state vectors or goal images from the same robot scene.
In this work, we explore alternate and more general forms of goal specification that are expected to be easier for humans to specify and use.
arXiv Detail & Related papers (2022-04-23T19:39:49Z) - Few-Shot Visual Grounding for Natural Human-Robot Interaction [0.0]
We propose a software architecture that segments a target object from a crowded scene, indicated verbally by a human user.
At the core of our system, we employ a multi-modal deep neural network for visual grounding.
We evaluate the performance of the proposed model on real RGB-D data collected from public scene datasets.
arXiv Detail & Related papers (2021-03-17T15:24:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.