Related papers: ViNT: A Foundation Model for Visual Navigation

ViNT: A Foundation Model for Visual Navigation

URL: http://arxiv.org/abs/2306.14846v2
Date: Tue, 24 Oct 2023 06:16:43 GMT
Title: ViNT: A Foundation Model for Visual Navigation
Authors: Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Stachowicz, Kevin Black, Noriaki Hirose, Sergey Levine
Abstract summary: Visual Navigation Transformer (ViNT) is a foundation model for vision-based robotic navigation. ViNT is trained with a general goal-reaching objective that can be used with any navigation dataset. It exhibits positive transfer, outperforming specialist models trained on singular datasets.
Score: 52.2571739391896
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: General-purpose pre-trained models ("foundation models") have enabled practitioners to produce generalizable solutions for individual machine learning problems with datasets that are significantly smaller than those required for learning from scratch. Such models are typically trained on large and diverse datasets with weak supervision, consuming much more training data than is available for any individual downstream application. In this paper, we describe the Visual Navigation Transformer (ViNT), a foundation model that aims to bring the success of general-purpose pre-trained models to vision-based robotic navigation. ViNT is trained with a general goal-reaching objective that can be used with any navigation dataset, and employs a flexible Transformer-based architecture to learn navigational affordances and enable efficient adaptation to a variety of downstream navigational tasks. ViNT is trained on a number of existing navigation datasets, comprising hundreds of hours of robotic navigation from a variety of different robotic platforms, and exhibits positive transfer, outperforming specialist models trained on singular datasets. ViNT can be augmented with diffusion-based subgoal proposals to explore novel environments, and can solve kilometer-scale navigation problems when equipped with long-range heuristics. ViNT can also be adapted to novel task specifications with a technique inspired by prompt-tuning, where the goal encoder is replaced by an encoding of another task modality (e.g., GPS waypoints or routing commands) embedded into the same space of goal tokens. This flexibility and ability to accommodate a variety of downstream problem domains establishes ViNT as an effective foundation model for mobile robotics. For videos, code, and model checkpoints, see our project page at https://visualnav-transformer.github.io.

Related papers

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks [24.690910258151693]
Existing models for embodied navigation fall short of serving as practical generalists in the real world. We present Uni-NaVid, the first video-based vision-language-action model designed to unify diverse embodied navigation tasks. Uni-NaVid achieves this by the input and output data configurations for all commonly used embodied navigation tasks.
arXiv Detail & Related papers (2024-12-09T05:55:55Z)
Navigation World Models [68.58459393846461]
We introduce a controllable video generation model that predicts future visual observations based on past observations and navigation actions. In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy.
arXiv Detail & Related papers (2024-12-04T18:59:45Z)
Vision-and-Language Navigation Generative Pretrained Transformer [0.0]
Vision-and-Language Navigation Generative Pretrained Transformer (VLN-GPT) Adopts transformer decoder model (GPT2) to model trajectory sequence dependencies, bypassing the need for historical encoding modules. Performance assessments on the VLN dataset reveal that VLN-GPT surpasses complex state-of-the-art encoder-based models.
arXiv Detail & Related papers (2024-05-27T09:42:04Z)
VANP: Learning Where to See for Navigation with Self-Supervised Vision-Action Pre-Training [8.479135285935113]
Humans excel at efficiently navigating through crowds without collision by focusing on specific visual regions relevant to navigation. Most robotic visual navigation methods rely on deep learning models pre-trained on vision tasks, which prioritize salient objects. We propose a Self-Supervised Vision-Action Model for Visual Navigation Pre-Training (VANP)
arXiv Detail & Related papers (2024-03-12T22:33:08Z)
NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration [57.15811390835294]
This paper describes how we can train a single unified diffusion policy to handle both goal-directed navigation and goal-agnostic exploration. We show that this unified policy results in better overall performance when navigating to visually indicated goals in novel environments. Our experiments, conducted on a real-world mobile robot platform, show effective navigation in unseen environments in comparison with five alternative methods.
arXiv Detail & Related papers (2023-10-11T21:07:14Z)
GNM: A General Navigation Model to Drive Any Robot [67.40225397212717]
General goal-conditioned model for vision-based navigation can be trained on data obtained from many distinct but structurally similar robots. We analyze the necessary design decisions for effective data sharing across robots. We deploy the trained GNM on a range of new robots, including an under quadrotor.
arXiv Detail & Related papers (2022-10-07T07:26:41Z)
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories. We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z)
Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation [97.17517060585875]
We present a unified approach to visual navigation using a novel modular transfer learning model. Our model can effectively leverage its experience from one source task and apply it to multiple target tasks. Our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.
arXiv Detail & Related papers (2022-02-05T00:07:21Z)
Polyline Based Generative Navigable Space Segmentation for Autonomous Visual Navigation [57.3062528453841]
We propose a representation-learning-based framework to enable robots to learn the navigable space segmentation in an unsupervised manner. We show that the proposed PSV-Nets can learn the visual navigable space with high accuracy, even without any single label.
arXiv Detail & Related papers (2021-10-29T19:50:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.