CALVIN: A Benchmark for Language-conditioned Policy Learning for
Long-horizon Robot Manipulation Tasks
- URL: http://arxiv.org/abs/2112.03227v2
- Date: Wed, 8 Dec 2021 10:04:13 GMT
- Title: CALVIN: A Benchmark for Language-conditioned Policy Learning for
Long-horizon Robot Manipulation Tasks
- Authors: Oier Mees, Lukas Hermann, Erick Rosete-Beas, Wolfram Burgard
- Abstract summary: General-purpose robots must learn to relate human language to their perceptions and actions.
We present CALVIN, an open-source simulated benchmark to learn long-horizon language-conditioned tasks.
- Score: 30.936692970187416
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: General-purpose robots coexisting with humans in their environment must learn
to relate human language to their perceptions and actions to be useful in a
range of daily tasks. Moreover, they need to acquire a diverse repertoire of
general-purpose skills that allow composing long-horizon tasks by following
unconstrained language instructions. In this paper, we present CALVIN
(Composing Actions from Language and Vision), an open-source simulated
benchmark to learn long-horizon language-conditioned tasks. Our aim is to make
it possible to develop agents that can solve many robotic manipulation tasks
over a long horizon, from onboard sensors, and specified only via human
language. CALVIN tasks are more complex in terms of sequence length, action
space, and language than existing vision-and-language task datasets and
supports flexible specification of sensor suites. We evaluate the agents in
zero-shot to novel language instructions and to novel environments and objects.
We show that a baseline model based on multi-context imitation learning
performs poorly on CALVIN, suggesting that there is significant room for
developing innovative agents that learn to relate human language to their world
models with this benchmark.
Related papers
- Rethinking Mutual Information for Language Conditioned Skill Discovery
on Imitation Learning [36.624923972563415]
We propose an end-to-end imitation learning approach known as Language Conditioned Skill Discovery (LCSD)
We utilize vector quantization to learn discrete latent skills and leverage skill sequences of trajectories to reconstruct high-level semantic instructions.
Our approach exhibits enhanced generalization capabilities towards unseen tasks, improved skill interpretability, and notably higher rates of task completion success.
arXiv Detail & Related papers (2024-02-27T13:53:52Z) - Simple Embodied Language Learning as a Byproduct of Meta-Reinforcement
Learning [56.07190845063208]
We ask: can embodied reinforcement learning (RL) agents indirectly learn language from non-language tasks?
We design an office navigation environment, where the agent's goal is to find a particular office, and office locations differ in different buildings (i.e., tasks)
We find RL agents indeed are able to indirectly learn language. Agents trained with current meta-RL algorithms successfully generalize to reading floor plans with held-out layouts and language phrases.
arXiv Detail & Related papers (2023-06-14T09:48:48Z) - Accessible Instruction-Following Agent [0.0]
We introduce UVLN, a novel machine-translation instructional augmented framework for cross-lingual vision-language navigation.
We extend the standard VLN training objectives to a multilingual setting via a cross-lingual language encoder.
Experiments over Room Across Room dataset prove the effectiveness of our approach.
arXiv Detail & Related papers (2023-05-08T23:57:26Z) - ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous
States in Realistic 3D Scenes [72.83187997344406]
ARNOLD is a benchmark that evaluates language-grounded task learning with continuous states in realistic 3D scenes.
ARNOLD is comprised of 8 language-conditioned tasks that involve understanding object states and learning policies for continuous goals.
arXiv Detail & Related papers (2023-04-09T21:42:57Z) - Language-Driven Representation Learning for Robotics [115.93273609767145]
Recent work in visual representation learning for robotics demonstrates the viability of learning from large video datasets of humans performing everyday tasks.
We introduce a framework for language-driven representation learning from human videos and captions.
We find that Voltron's language-driven learning outperform the prior-of-the-art, especially on targeted problems requiring higher-level control.
arXiv Detail & Related papers (2023-02-24T17:29:31Z) - Robotic Skill Acquisition via Instruction Augmentation with
Vision-Language Models [70.82705830137708]
We introduce Data-driven Instruction Augmentation for Language-conditioned control (DIAL)
We utilize semi-language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data.
DIAL enables imitation learning policies to acquire new capabilities and generalize to 60 novel instructions unseen in the original dataset.
arXiv Detail & Related papers (2022-11-21T18:56:00Z) - Grounding Language with Visual Affordances over Unstructured Data [26.92329260907805]
We propose a novel approach to efficiently learn language-conditioned robot skills from unstructured, offline and reset-free data.
We exploit a self-supervised visuo-lingual affordance model, which requires as little as 1% of the total data with language.
We find that our method is capable of completing long-horizon, multi-tier tasks in the real world, while requiring an order of magnitude less data than previous approaches.
arXiv Detail & Related papers (2022-10-04T21:16:48Z) - What Matters in Language Conditioned Robotic Imitation Learning [26.92329260907805]
We study the most critical challenges in learning language conditioned policies from offline free-form imitation datasets.
We present a novel approach that significantly outperforms the state of the art on the challenging language conditioned long-horizon robot manipulation CALVIN benchmark.
arXiv Detail & Related papers (2022-04-13T08:45:32Z) - Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [119.29555551279155]
Large language models can encode a wealth of semantic knowledge about the world.
Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language.
We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions.
arXiv Detail & Related papers (2022-04-04T17:57:11Z) - Learning Language-Conditioned Robot Behavior from Offline Data and
Crowd-Sourced Annotation [80.29069988090912]
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction.
We propose to leverage offline robot datasets with crowd-sourced natural language labels.
We find that our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%.
arXiv Detail & Related papers (2021-09-02T17:42:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.