Robotic Skill Acquisition via Instruction Augmentation with
Vision-Language Models
- URL: http://arxiv.org/abs/2211.11736v3
- Date: Sat, 1 Jul 2023 05:38:52 GMT
- Title: Robotic Skill Acquisition via Instruction Augmentation with
Vision-Language Models
- Authors: Ted Xiao and Harris Chan and Pierre Sermanet and Ayzaan Wahid and
Anthony Brohan and Karol Hausman and Sergey Levine and Jonathan Tompson
- Abstract summary: We introduce Data-driven Instruction Augmentation for Language-conditioned control (DIAL)
We utilize semi-language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data.
DIAL enables imitation learning policies to acquire new capabilities and generalize to 60 novel instructions unseen in the original dataset.
- Score: 70.82705830137708
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, much progress has been made in learning robotic manipulation
policies that follow natural language instructions. Such methods typically
learn from corpora of robot-language data that was either collected with
specific tasks in mind or expensively re-labelled by humans with rich language
descriptions in hindsight. Recently, large-scale pretrained vision-language
models (VLMs) like CLIP or ViLD have been applied to robotics for learning
representations and scene descriptors. Can these pretrained models serve as
automatic labelers for robot data, effectively importing Internet-scale
knowledge into existing datasets to make them useful even for tasks that are
not reflected in their ground truth annotations? To accomplish this, we
introduce Data-driven Instruction Augmentation for Language-conditioned control
(DIAL): we utilize semi-supervised language labels leveraging the semantic
understanding of CLIP to propagate knowledge onto large datasets of unlabelled
demonstration data and then train language-conditioned policies on the
augmented datasets. This method enables cheaper acquisition of useful language
descriptions compared to expensive human labels, allowing for more efficient
label coverage of large-scale datasets. We apply DIAL to a challenging
real-world robotic manipulation domain where 96.5% of the 80,000 demonstrations
do not contain crowd-sourced language annotations. DIAL enables imitation
learning policies to acquire new capabilities and generalize to 60 novel
instructions unseen in the original dataset.
Related papers
- Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models [4.117963644248323]
We introduce NILS: Natural language Instruction Labeling for Scalability.
NILS automatically labels uncurated, long-horizon robot data at scale.
We use NILS to label over 115k trajectories obtained from over 430 hours of robot data.
arXiv Detail & Related papers (2024-10-23T11:19:48Z) - KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data [45.25288643161976]
We propose Keypoint Affordance Learning from Imagined Environments (KALIE) for robotic control in a scalable manner.
Instead of directly producing motor commands, KALIE controls the robot by predicting point-based affordance representations.
We demonstrate that KALIE can learn to robustly solve new manipulation tasks with unseen objects given only 50 example data points.
arXiv Detail & Related papers (2024-09-21T08:45:16Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
Vision Language Models (VLMs) can process state information as visual-textual prompts and respond with policy decisions in text.
We propose LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as conversations.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - Object-Centric Instruction Augmentation for Robotic Manipulation [29.491990994901666]
We introduce the textitObject-Centric Instruction Augmentation (OCI) framework to augment highly semantic and information-dense language instruction with position cues.
We utilize a Multi-modal Large Language Model (MLLM) to weave knowledge of object locations into natural language instruction.
We demonstrate that robotic manipulator imitation policies trained with our enhanced instructions outperform those relying solely on traditional language instructions.
arXiv Detail & Related papers (2024-01-05T13:54:45Z) - Learning to Model the World with Language [100.76069091703505]
To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world.
Our key idea is that agents should interpret such diverse language as a signal that helps them predict the future.
We instantiate this in Dynalang, an agent that learns a multimodal world model to predict future text and image representations.
arXiv Detail & Related papers (2023-07-31T17:57:49Z) - Goal Representations for Instruction Following: A Semi-Supervised
Language Interface to Control [58.06223121654735]
We show a method that taps into joint image- and goal- conditioned policies with language using only a small amount of language data.
Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image.
We show instruction following across a variety of manipulation tasks in different scenes, with generalization to language instructions outside of the labeled data.
arXiv Detail & Related papers (2023-06-30T20:09:39Z) - Grounding Language with Visual Affordances over Unstructured Data [26.92329260907805]
We propose a novel approach to efficiently learn language-conditioned robot skills from unstructured, offline and reset-free data.
We exploit a self-supervised visuo-lingual affordance model, which requires as little as 1% of the total data with language.
We find that our method is capable of completing long-horizon, multi-tier tasks in the real world, while requiring an order of magnitude less data than previous approaches.
arXiv Detail & Related papers (2022-10-04T21:16:48Z) - Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [119.29555551279155]
Large language models can encode a wealth of semantic knowledge about the world.
Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language.
We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions.
arXiv Detail & Related papers (2022-04-04T17:57:11Z) - Learning Language-Conditioned Robot Behavior from Offline Data and
Crowd-Sourced Annotation [80.29069988090912]
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction.
We propose to leverage offline robot datasets with crowd-sourced natural language labels.
We find that our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%.
arXiv Detail & Related papers (2021-09-02T17:42:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.