LUMOS: Language-Conditioned Imitation Learning with World Models
- URL: http://arxiv.org/abs/2503.10370v1
- Date: Thu, 13 Mar 2025 13:48:24 GMT
- Title: LUMOS: Language-Conditioned Imitation Learning with World Models
- Authors: Iman Nematollahi, Branton DeMoss, Akshay L Chandra, Nick Hawes, Wolfram Burgard, Ingmar Posner,
- Abstract summary: We introduce LUMOS, a language-conditioned multi-task imitation learning framework for robotics.<n>LUMOS learns skills by practicing them over many long-horizon rollouts in the latent space of a learned world model.<n>We are the first to learn a language-conditioned continuous visuomotor control for a real-world robot within an offline world model.
- Score: 31.827127896338336
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce LUMOS, a language-conditioned multi-task imitation learning framework for robotics. LUMOS learns skills by practicing them over many long-horizon rollouts in the latent space of a learned world model and transfers these skills zero-shot to a real robot. By learning on-policy in the latent space of the learned world model, our algorithm mitigates policy-induced distribution shift which most offline imitation learning methods suffer from. LUMOS learns from unstructured play data with fewer than 1% hindsight language annotations but is steerable with language commands at test time. We achieve this coherent long-horizon performance by combining latent planning with both image- and language-based hindsight goal relabeling during training, and by optimizing an intrinsic reward defined in the latent space of the world model over multiple time steps, effectively reducing covariate shift. In experiments on the difficult long-horizon CALVIN benchmark, LUMOS outperforms prior learning-based methods with comparable approaches on chained multi-task evaluations. To the best of our knowledge, we are the first to learn a language-conditioned continuous visuomotor control for a real-world robot within an offline world model. Videos, dataset and code are available at http://lumos.cs.uni-freiburg.de.
Related papers
- Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics [25.2461925479135]
Video-Language Critic is a reward model that can be trained on readily available cross-embodiment data.
Our model enables 2x more sample-efficient policy training on Meta-World tasks than a sparse reward only.
arXiv Detail & Related papers (2024-05-30T12:18:06Z) - From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control [58.72492647570062]
We introduce our method -- Learnable Latent Codes as Bridges (LCB) -- as an alternate architecture to overcome limitations.
We find that methodoutperforms baselines that leverage pure language as the interface layer on tasks that require reasoning and multi-step behaviors.
arXiv Detail & Related papers (2024-05-08T04:14:06Z) - Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks [50.27313829438866]
Plan-Seq-Learn (PSL) is a modular approach that uses motion planning to bridge the gap between abstract language and learned low-level control.
PSL achieves success rates of over 85%, out-performing language-based, classical, and end-to-end approaches.
arXiv Detail & Related papers (2024-05-02T17:59:31Z) - Large Language Models for Orchestrating Bimanual Robots [19.60907949776435]
We present LAnguage-model-based Bimanual ORchestration (LABOR) to analyze task configurations and devise coordination control policies.
We evaluate our method through simulated experiments involving two classes of long-horizon tasks using the NICOL humanoid robot.
arXiv Detail & Related papers (2024-04-02T15:08:35Z) - ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous
States in Realistic 3D Scenes [72.83187997344406]
ARNOLD is a benchmark that evaluates language-grounded task learning with continuous states in realistic 3D scenes.
ARNOLD is comprised of 8 language-conditioned tasks that involve understanding object states and learning policies for continuous goals.
arXiv Detail & Related papers (2023-04-09T21:42:57Z) - DITTO: Offline Imitation Learning with World Models [21.419536711242962]
DITTO is an offline imitation learning algorithm which addresses all three of these problems.
We optimize this multi-step latent divergence using standard reinforcement learning algorithms.
Our results show how creative use of world models can lead to a simple, robust, and highly-performant policy-learning framework.
arXiv Detail & Related papers (2023-02-06T19:41:18Z) - Grounding Language with Visual Affordances over Unstructured Data [26.92329260907805]
We propose a novel approach to efficiently learn language-conditioned robot skills from unstructured, offline and reset-free data.
We exploit a self-supervised visuo-lingual affordance model, which requires as little as 1% of the total data with language.
We find that our method is capable of completing long-horizon, multi-tier tasks in the real world, while requiring an order of magnitude less data than previous approaches.
arXiv Detail & Related papers (2022-10-04T21:16:48Z) - Multi-level Contrastive Learning for Cross-lingual Spoken Language
Understanding [90.87454350016121]
We develop novel code-switching schemes to generate hard negative examples for contrastive learning at all levels.
We develop a label-aware joint model to leverage label semantics for cross-lingual knowledge transfer.
arXiv Detail & Related papers (2022-05-07T13:44:28Z) - What Matters in Language Conditioned Robotic Imitation Learning [26.92329260907805]
We study the most critical challenges in learning language conditioned policies from offline free-form imitation datasets.
We present a novel approach that significantly outperforms the state of the art on the challenging language conditioned long-horizon robot manipulation CALVIN benchmark.
arXiv Detail & Related papers (2022-04-13T08:45:32Z) - CALVIN: A Benchmark for Language-conditioned Policy Learning for
Long-horizon Robot Manipulation Tasks [30.936692970187416]
General-purpose robots must learn to relate human language to their perceptions and actions.
We present CALVIN, an open-source simulated benchmark to learn long-horizon language-conditioned tasks.
arXiv Detail & Related papers (2021-12-06T18:37:33Z) - LILA: Language-Informed Latent Actions [72.033770901278]
We introduce Language-Informed Latent Actions (LILA), a framework for learning natural language interfaces in the context of human-robot collaboration.
LILA learns to use language to modulate a low-dimensional controller, providing users with a language-informed control space.
We show that LILA models are not only more sample efficient and performant than imitation learning and end-effector control baselines, but that they are also qualitatively preferred by users.
arXiv Detail & Related papers (2021-11-05T00:56:00Z) - Multi-View Learning for Vision-and-Language Navigation [163.20410080001324]
Learn from EveryOne (LEO) is a training paradigm for learning to navigate in a visual environment.
By sharing parameters across instructions, our approach learns more effectively from limited training data.
On the recent Room-to-Room (R2R) benchmark dataset, LEO achieves 16% improvement (absolute) over a greedy agent.
arXiv Detail & Related papers (2020-03-02T13:07:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.