LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language,
Vision, and Action
- URL: http://arxiv.org/abs/2207.04429v1
- Date: Sun, 10 Jul 2022 10:41:50 GMT
- Title: LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language,
Vision, and Action
- Authors: Dhruv Shah, Blazej Osinski, Brian Ichter, Sergey Levine
- Abstract summary: We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories.
We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
- Score: 76.71101507291473
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Goal-conditioned policies for robotic navigation can be trained on large,
unannotated datasets, providing for good generalization to real-world settings.
However, particularly in vision-based settings where specifying goals requires
an image, this makes for an unnatural interface. Language provides a more
convenient modality for communication with robots, but contemporary methods
typically require expensive supervision, in the form of trajectories annotated
with language descriptions. We present a system, LM-Nav, for robotic navigation
that enjoys the benefits of training on unannotated large datasets of
trajectories, while still providing a high-level interface to the user. Instead
of utilizing a labeled instruction following dataset, we show that such a
system can be constructed entirely out of pre-trained models for navigation
(ViNG), image-language association (CLIP), and language modeling (GPT-3),
without requiring any fine-tuning or language-annotated robot data. We
instantiate LM-Nav on a real-world mobile robot and demonstrate long-horizon
navigation through complex, outdoor environments from natural language
instructions. For videos of our experiments, code release, and an interactive
Colab notebook that runs in your browser, please check out our project page
https://sites.google.com/view/lmnav
Related papers
- FASTNav: Fine-tuned Adaptive Small-language-models Trained for Multi-point Robot Navigation [10.3997505825422]
We propose FASTNav - a method for boosting lightweight language models (SLMs) for robot navigation.
We train and evaluate models with FASTNav in both simulation and real robots, proving that we can deploy them with low cost, high accuracy and low response time.
arXiv Detail & Related papers (2024-11-20T12:28:13Z) - Vision and Language Navigation in the Real World via Online Visual
Language Mapping [18.769171505280127]
Vision-and-language navigation (VLN) methods are mainly evaluated in simulation.
We propose a novel framework to address the VLN task in the real world.
We evaluate the proposed pipeline on an Interbotix LoCoBot WX250 in an unseen lab environment.
arXiv Detail & Related papers (2023-10-16T20:44:09Z) - Interactive Navigation in Environments with Traversable Obstacles Using
Large Language and Vision-Language Models [14.871309526022516]
This paper proposes an interactive navigation framework by using large language and vision-language models.
We create an action-aware costmap to perform effective path planning without fine-tuning.
All experimental results demonstrated the proposed framework's effectiveness and adaptability to diverse environments.
arXiv Detail & Related papers (2023-10-13T05:59:03Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - $A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting
Vision-and-Language Ability of Foundation Models [89.64729024399634]
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions.
Normally, the instructions have complex grammatical structures and often contain various action descriptions.
How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging.
arXiv Detail & Related papers (2023-08-15T19:01:19Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - Open-World Object Manipulation using Pre-trained Vision-Language Models [72.87306011500084]
For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary.
We develop a simple approach, which leverages a pre-trained vision-language model to extract object-identifying information.
In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments.
arXiv Detail & Related papers (2023-03-02T01:55:10Z) - GNM: A General Navigation Model to Drive Any Robot [67.40225397212717]
General goal-conditioned model for vision-based navigation can be trained on data obtained from many distinct but structurally similar robots.
We analyze the necessary design decisions for effective data sharing across robots.
We deploy the trained GNM on a range of new robots, including an under quadrotor.
arXiv Detail & Related papers (2022-10-07T07:26:41Z) - Reshaping Robot Trajectories Using Natural Language Commands: A Study of
Multi-Modal Data Alignment Using Transformers [33.7939079214046]
We provide a flexible language-based interface for human-robot collaboration.
We take advantage of recent advancements in the field of large language models to encode the user command.
We train the model using imitation learning over a dataset containing robot trajectories modified by language commands.
arXiv Detail & Related papers (2022-03-25T01:36:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.