NaVILA: Legged Robot Vision-Language-Action Model for Navigation
- URL: http://arxiv.org/abs/2412.04453v2
- Date: Mon, 17 Feb 2025 18:27:27 GMT
- Title: NaVILA: Legged Robot Vision-Language-Action Model for Navigation
- Authors: An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, Xiaolong Wang,
- Abstract summary: It is non-trivial to translate human language instructions all the way to low-level leg joint actions.
We propose NaVILA, a 2-level framework that unifies a Vision-Language-Action model (VLA) with locomotion skills.
NaVILA substantially improves previous approaches on existing benchmarks.
- Score: 60.00462044102051
- License:
- Abstract: This paper proposes to solve the problem of Vision-and-Language Navigation with legged robots, which not only provides a flexible way for humans to command but also allows the robot to navigate through more challenging and cluttered scenes. However, it is non-trivial to translate human language instructions all the way to low-level leg joint actions. We propose NaVILA, a 2-level framework that unifies a Vision-Language-Action model (VLA) with locomotion skills. Instead of directly predicting low-level actions from VLA, NaVILA first generates mid-level actions with spatial information in the form of language, (e.g., "moving forward 75cm"), which serves as an input for a visual locomotion RL policy for execution. NaVILA substantially improves previous approaches on existing benchmarks. The same advantages are demonstrated in our newly developed benchmarks with IsaacLab, featuring more realistic scenes, low-level controls, and real-world robot experiments. We show more results at https://navila-bot.github.io/
Related papers
- LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.
First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.
We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics [46.63773228934993]
We introduce an automatic synthetic data generation pipeline that instruction-tunes vision language models (VLMs) to robotic domains and needs.
Using the pipeline, we train RoboPoint, a VLM that predicts image keypoint affordances given language instructions.
Our experiments demonstrate that RoboPoint outperforms state-of-the-art VLMs by 21.8% in the accuracy of predicting spatial affordance and by 30.5% in the success rate of downstream tasks.
arXiv Detail & Related papers (2024-06-15T19:22:51Z) - Yell At Your Robot: Improving On-the-Fly from Language Corrections [84.09578841663195]
We show that high-level policies can be readily supervised with human feedback in the form of language corrections.
This framework enables robots not only to rapidly adapt to real-time language feedback, but also incorporate this feedback into an iterative training scheme.
arXiv Detail & Related papers (2024-03-19T17:08:24Z) - NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation [23.72290930234063]
NaVid is a video-based large vision language model (VLM) for vision-and-language navigation.
NaVid achieves state-of-the-art performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer.
arXiv Detail & Related papers (2024-02-24T16:39:16Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - ViNL: Visual Navigation and Locomotion Over Obstacles [36.46953494419389]
We present Visual Navigation and Locomotion over obstacles (ViNL)
It enables a quadrupedal robot to navigate unseen apartments while stepping over small obstacles that lie in its path.
ViNL consists of: (1) a visual navigation policy that outputs linear and angular velocity commands that guides the robot to a goal coordinate in unfamiliar indoor environments; and (2) a visual locomotion policy that controls the robot's joints to avoid stepping on obstacles while following provided velocity commands.
arXiv Detail & Related papers (2022-10-26T15:38:28Z) - LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language,
Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories.
We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.