Offline Visual Representation Learning for Embodied Navigation
- URL: http://arxiv.org/abs/2204.13226v1
- Date: Wed, 27 Apr 2022 23:22:43 GMT
- Title: Offline Visual Representation Learning for Embodied Navigation
- Authors: Karmesh Yadav, Ram Ramrakhya, Arjun Majumdar, Vincent-Pierre Berges,
Sachit Kuhar, Dhruv Batra, Alexei Baevski, Oleksandr Maksymets
- Abstract summary: offline pretraining of visual representations with self-supervised learning (SSL)
Online finetuning of visuomotor representations on specific tasks with image augmentations under long learning schedules.
- Score: 50.442660137987275
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How should we learn visual representations for embodied agents that must see
and move? The status quo is tabula rasa in vivo, i.e. learning visual
representations from scratch while also learning to move, potentially augmented
with auxiliary tasks (e.g. predicting the action taken between two successive
observations). In this paper, we show that an alternative 2-stage strategy is
far more effective: (1) offline pretraining of visual representations with
self-supervised learning (SSL) using large-scale pre-rendered images of indoor
environments (Omnidata), and (2) online finetuning of visuomotor
representations on specific tasks with image augmentations under long learning
schedules. We call this method Offline Visual Representation Learning (OVRL).
We conduct large-scale experiments - on 3 different 3D datasets (Gibson, HM3D,
MP3D), 2 tasks (ImageNav, ObjectNav), and 2 policy learning algorithms (RL, IL)
- and find that the OVRL representations lead to significant across-the-board
improvements in state of art, on ImageNav from 29.2% to 54.2% (+25% absolute,
86% relative) and on ObjectNav from 18.1% to 23.2% (+5.1% absolute, 28%
relative). Importantly, both results were achieved by the same visual encoder
generalizing to datasets that were not seen during pretraining. While the
benefits of pretraining sometimes diminish (or entirely disappear) with long
finetuning schedules, we find that OVRL's performance gains continue to
increase (not decrease) as the agent is trained for 2 billion frames of
experience.
Related papers
- Pretrained Visual Representations in Reinforcement Learning [0.0]
This paper compares the performance of visual reinforcement learning algorithms that train a convolutional neural network (CNN) from scratch with those that utilize pre-trained visual representations (PVRs)
We evaluate the Dormant Ratio Minimization (DRM) algorithm, a state-of-the-art visual RL method, against three PVRs: ResNet18, DINOv2, and Visual Cortex (VC)
arXiv Detail & Related papers (2024-07-24T12:53:26Z) - Towards Better Data Exploitation in Self-Supervised Monocular Depth
Estimation [14.262669370264994]
In this paper, we take two data augmentation techniques, namely Resizing-Cropping and Splitting-Permuting, to fully exploit the potential of training datasets.
Specifically, the original image and the generated two augmented images are fed into the training pipeline simultaneously and we leverage them to conduct self-distillation.
Experimental results demonstrate our method can achieve state-of-the-art performance on the KITTI benchmark, with both raw ground truth and improved ground truth.
arXiv Detail & Related papers (2023-09-11T06:18:05Z) - OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav [62.32806118504701]
We present a single neural network architecture that achieves state-of-art results on both the ImageNav and ObjectNav tasks.
Such general-purpose methods offer advantages of simplicity in design, positive scaling with available compute, and versatile applicability to multiple tasks.
arXiv Detail & Related papers (2023-03-14T11:15:37Z) - Leveraging the Third Dimension in Contrastive Learning [88.17394309208925]
Self-Supervised Learning (SSL) methods operate on unlabeled data to learn robust representations useful for downstream tasks.
These augmentations ignore the fact that biological vision takes place in an immersive three-dimensional, temporally contiguous environment.
We explore two distinct approaches to incorporating depth signals into the SSL framework.
arXiv Detail & Related papers (2023-01-27T15:45:03Z) - VRL3: A Data-Driven Framework for Visual Deep Reinforcement Learning [14.869611817084015]
We propose VRL3, a data-driven framework for solving visual deep reinforcement learning (DRL) tasks.
Our framework has three stages: in stage 1, we leverage non-RL datasets to learn task-agnostic visual representations; in stage 2, we use offline RL data; in stage 3, we fine-tune the agent with online RL.
On a set of challenging hand manipulation tasks, VRL3 achieves an average of 780% better sample efficiency.
arXiv Detail & Related papers (2022-02-17T09:51:32Z) - VL-LTR: Learning Class-wise Visual-Linguistic Representation for
Long-Tailed Visual Recognition [61.75391989107558]
We present a visual-linguistic long-tailed recognition framework, termed VL-LTR.
Our method can learn visual representation from images and corresponding linguistic representation from noisy class-level text descriptions.
Notably, our method achieves 77.2% overall accuracy on ImageNet-LT, which significantly outperforms the previous best method by over 17 points.
arXiv Detail & Related papers (2021-11-26T16:24:03Z) - Efficient Self-supervised Vision Transformers for Representation
Learning [86.57557009109411]
We show that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity.
We propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies.
Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation.
arXiv Detail & Related papers (2021-06-17T19:57:33Z) - Spatiotemporal Contrastive Video Representation Learning [87.56145031149869]
We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn visual representations from unlabeled videos.
Our representations are learned using a contrasttemporalive loss, where two augmented clips from the same short video are pulled together in the embedding space.
We study what makes for good data augmentations for video self-supervised learning and find that both spatial and temporal information are crucial.
arXiv Detail & Related papers (2020-08-09T19:58:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.