OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav
- URL: http://arxiv.org/abs/2303.07798v1
- Date: Tue, 14 Mar 2023 11:15:37 GMT
- Title: OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav
- Authors: Karmesh Yadav, Arjun Majumdar, Ram Ramrakhya, Naoki Yokoyama, Alexei
Baevski, Zsolt Kira, Oleksandr Maksymets, Dhruv Batra
- Abstract summary: We present a single neural network architecture that achieves state-of-art results on both the ImageNav and ObjectNav tasks.
Such general-purpose methods offer advantages of simplicity in design, positive scaling with available compute, and versatile applicability to multiple tasks.
- Score: 62.32806118504701
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a single neural network architecture composed of task-agnostic
components (ViTs, convolutions, and LSTMs) that achieves state-of-art results
on both the ImageNav ("go to location in <this picture>") and ObjectNav ("find
a chair") tasks without any task-specific modules like object detection,
segmentation, mapping, or planning modules. Such general-purpose methods offer
advantages of simplicity in design, positive scaling with available compute,
and versatile applicability to multiple tasks. Our work builds upon the recent
success of self-supervised learning (SSL) for pre-training vision transformers
(ViT). However, while the training recipes for convolutional networks are
mature and robust, the recipes for ViTs are contingent and brittle, and in the
case of ViTs for visual navigation, yet to be fully discovered. Specifically,
we find that vanilla ViTs do not outperform ResNets on visual navigation. We
propose the use of a compression layer operating over ViT patch representations
to preserve spatial information along with policy training improvements. These
improvements allow us to demonstrate positive scaling laws for the first time
in visual navigation tasks. Consequently, our model advances state-of-the-art
performance on ImageNav from 54.2% to 82.0% success and performs competitively
against concurrent state-of-art on ObjectNav with success rate of 64.0% vs.
65.0%. Overall, this work does not present a fundamentally new approach, but
rather recommendations for training a general-purpose architecture that
achieves state-of-art performance today and could serve as a strong baseline
for future methods.
Related papers
- An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training [51.622652121580394]
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features.
In this paper, we question if the textitextremely simple lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm.
Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design ($5.7M$/$6.5M$) can achieve $79.4%$/$78.9%$ top-1 accuracy on ImageNet-1
arXiv Detail & Related papers (2024-04-18T14:14:44Z) - Learning Navigational Visual Representations with Semantic Map
Supervision [85.91625020847358]
We propose a navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps.
Ego$2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation.
arXiv Detail & Related papers (2023-07-23T14:01:05Z) - Offline Visual Representation Learning for Embodied Navigation [50.442660137987275]
offline pretraining of visual representations with self-supervised learning (SSL)
Online finetuning of visuomotor representations on specific tasks with image augmentations under long learning schedules.
arXiv Detail & Related papers (2022-04-27T23:22:43Z) - DeiT III: Revenge of the ViT [56.46810490275699]
A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks.
Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT.
arXiv Detail & Related papers (2022-04-14T17:13:44Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z) - Simple but Effective: CLIP Embeddings for Embodied AI [38.02562593292301]
Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial for a range of visual tasks.
We build incredibly simple baselines, named EmbCLIP, with no task specific architectures.
We find that our improved baselines perform very well across a range of tasks and simulators.
arXiv Detail & Related papers (2021-11-18T18:59:59Z) - Auxiliary Tasks and Exploration Enable ObjectNav [48.314102158070874]
We re-enable a generic learned agent by adding auxiliary learning tasks and an exploration reward.
Our agents achieve 24.5% success and 8.1% SPL, a 37% and 8% relative improvement over prior state-of-the-art, respectively.
arXiv Detail & Related papers (2021-04-08T23:03:21Z) - Towards Learning a Generic Agent for Vision-and-Language Navigation via
Pre-training [150.35927365127176]
We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks.
By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions.
It learns more effectively in new tasks and generalizes better in a previously unseen environment.
arXiv Detail & Related papers (2020-02-25T03:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.