Goal Representations for Instruction Following: A Semi-Supervised
Language Interface to Control
- URL: http://arxiv.org/abs/2307.00117v2
- Date: Fri, 18 Aug 2023 00:47:00 GMT
- Title: Goal Representations for Instruction Following: A Semi-Supervised
Language Interface to Control
- Authors: Vivek Myers, Andre He, Kuan Fang, Homer Walke, Philippe
Hansen-Estruch, Ching-An Cheng, Mihai Jalobeanu, Andrey Kolobov, Anca Dragan,
Sergey Levine
- Abstract summary: We show a method that taps into joint image- and goal- conditioned policies with language using only a small amount of language data.
Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image.
We show instruction following across a variety of manipulation tasks in different scenes, with generalization to language instructions outside of the labeled data.
- Score: 58.06223121654735
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Our goal is for robots to follow natural language instructions like "put the
towel next to the microwave." But getting large amounts of labeled data, i.e.
data that contains demonstrations of tasks labeled with the language
instruction, is prohibitive. In contrast, obtaining policies that respond to
image goals is much easier, because any autonomous trial or demonstration can
be labeled in hindsight with its final state as the goal. In this work, we
contribute a method that taps into joint image- and goal- conditioned policies
with language using only a small amount of language data. Prior work has made
progress on this using vision-language models or by jointly training
language-goal-conditioned policies, but so far neither method has scaled
effectively to real-world robot tasks without significant human annotation. Our
method achieves robust performance in the real world by learning an embedding
from the labeled data that aligns language not to the goal image, but rather to
the desired change between the start and goal images that the instruction
corresponds to. We then train a policy on this embedding: the policy benefits
from all the unlabeled data, but the aligned embedding provides an interface
for language to steer the policy. We show instruction following across a
variety of manipulation tasks in different scenes, with generalization to
language instructions outside of the labeled data. Videos and code for our
approach can be found on our website: https://rail-berkeley.github.io/grif/ .
Related papers
- RT-H: Action Hierarchies Using Language [36.873648277512864]
Recent works in robot imitation learning use language-conditioned policies that predict actions given visual observations and the high-level task specified in language.
We show that RT-H builds an action hierarchy using language motions, and conditioned on this and the high-level task, it predicts actions, using visual context at all stages.
We show that these policies not only allow for responding to language interventions, but can also learn from such interventions and outperform methods that learn from teleoperated interventions.
arXiv Detail & Related papers (2024-03-04T08:16:11Z) - Learning to Model the World with Language [100.76069091703505]
To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world.
Our key idea is that agents should interpret such diverse language as a signal that helps them predict the future.
We instantiate this in Dynalang, an agent that learns a multimodal world model to predict future text and image representations.
arXiv Detail & Related papers (2023-07-31T17:57:49Z) - Robotic Skill Acquisition via Instruction Augmentation with
Vision-Language Models [70.82705830137708]
We introduce Data-driven Instruction Augmentation for Language-conditioned control (DIAL)
We utilize semi-language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data.
DIAL enables imitation learning policies to acquire new capabilities and generalize to 60 novel instructions unseen in the original dataset.
arXiv Detail & Related papers (2022-11-21T18:56:00Z) - Grounding Language with Visual Affordances over Unstructured Data [26.92329260907805]
We propose a novel approach to efficiently learn language-conditioned robot skills from unstructured, offline and reset-free data.
We exploit a self-supervised visuo-lingual affordance model, which requires as little as 1% of the total data with language.
We find that our method is capable of completing long-horizon, multi-tier tasks in the real world, while requiring an order of magnitude less data than previous approaches.
arXiv Detail & Related papers (2022-10-04T21:16:48Z) - Modular Framework for Visuomotor Language Grounding [57.93906820466519]
Natural language instruction following tasks serve as a valuable test-bed for grounded language and robotics research.
We propose the structuring of language, acting, and visual tasks into separate modules that can be trained independently.
arXiv Detail & Related papers (2021-09-05T20:11:53Z) - Learning Language-Conditioned Robot Behavior from Offline Data and
Crowd-Sourced Annotation [80.29069988090912]
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction.
We propose to leverage offline robot datasets with crowd-sourced natural language labels.
We find that our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%.
arXiv Detail & Related papers (2021-09-02T17:42:13Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z) - Language Conditioned Imitation Learning over Unstructured Data [9.69886122332044]
We present a method for incorporating free-form natural language conditioning into imitation learning.
Our approach learns perception from pixels, natural language understanding, and multitask continuous control end-to-end as a single neural network.
We show this dramatically improves language conditioned performance, while reducing the cost of language annotation to less than 1% of total data.
arXiv Detail & Related papers (2020-05-15T17:08:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.