Memory-Augmented Reinforcement Learning for Image-Goal Navigation
- URL: http://arxiv.org/abs/2101.05181v1
- Date: Wed, 13 Jan 2021 16:30:20 GMT
- Title: Memory-Augmented Reinforcement Learning for Image-Goal Navigation
- Authors: Lina Mezghani, Sainbayar Sukhbaatar, Thibaut Lavril, Oleksandr
Maksymets, Dhruv Batra, Piotr Bojanowski, Karteek Alahari
- Abstract summary: We present a novel method that leverages a cross-episode memory to learn to navigate.
In order to avoid overfitting, we propose to use data augmentation on the RGB input during training.
We obtain this competitive performance from RGB input only, without access to additional sensors such as position or depth.
- Score: 67.3963444878746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we address the problem of image-goal navigation in the context
of visually-realistic 3D environments. This task involves navigating to a
location indicated by a target image in a previously unseen environment.
Earlier attempts, including RL-based and SLAM-based approaches, have either
shown poor generalization performance, or are heavily-reliant on pose/depth
sensors. We present a novel method that leverages a cross-episode memory to
learn to navigate. We first train a state-embedding network in a
self-supervised fashion, and then use it to embed previously-visited states
into a memory. In order to avoid overfitting, we propose to use data
augmentation on the RGB input during training. We validate our approach through
extensive evaluations, showing that our data-augmented memory-based model
establishes a new state of the art on the image-goal navigation task in the
challenging Gibson dataset. We obtain this competitive performance from RGB
input only, without access to additional sensors such as position or depth.
Related papers
- GaussNav: Gaussian Splatting for Visual Navigation [92.13664084464514]
Instance ImageGoal Navigation (IIN) requires an agent to locate a specific object depicted in a goal image within an unexplored environment.
Our framework constructs a novel map representation based on 3D Gaussian Splatting (3DGS)
Our framework demonstrates a significant leap in performance, evidenced by an increase in Success weighted by Path Length (SPL) from 0.252 to 0.578 on the challenging Habitat-Matterport 3D (HM3D) dataset.
arXiv Detail & Related papers (2024-03-18T09:56:48Z) - How To Not Train Your Dragon: Training-free Embodied Object Goal
Navigation with Semantic Frontiers [94.46825166907831]
We present a training-free solution to tackle the object goal navigation problem in Embodied AI.
Our method builds a structured scene representation based on the classic visual simultaneous localization and mapping (V-SLAM) framework.
Our method propagates semantics on the scene graphs based on language priors and scene statistics to introduce semantic knowledge to the geometric frontiers.
arXiv Detail & Related papers (2023-05-26T13:38:33Z) - Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain.
GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors.
We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z) - Depth Monocular Estimation with Attention-based Encoder-Decoder Network
from Single Image [7.753378095194288]
Vision-based approaches have recently received much attention and can overcome these drawbacks.
In this work, we explore an extreme scenario in vision-based settings: estimate a depth map from one monocular image severely plagued by grid artifacts and blurry edges.
Our novel approach can find the focus of current image with minimal overhead and avoid losses of depth features.
arXiv Detail & Related papers (2022-10-24T23:01:25Z) - GoToNet: Fast Monocular Scene Exposure and Exploration [0.6204265638103346]
We present a novel method for real-time environment exploration.
Our method requires only one look (image) to make a good tactical decision.
Two direction predictions, characterized by pixels dubbed the Goto and Lookat pixels, comprise the core of our method.
arXiv Detail & Related papers (2022-06-13T08:28:31Z) - Simple and Effective Synthesis of Indoor 3D Scenes [78.95697556834536]
We study the problem of immersive 3D indoor scenes from one or more images.
Our aim is to generate high-resolution images and videos from novel viewpoints.
We propose an image-to-image GAN that maps directly from reprojections of incomplete point clouds to full high-resolution RGB-D images.
arXiv Detail & Related papers (2022-04-06T17:54:46Z) - Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD
Images [69.5662419067878]
Grounding referring expressions in RGBD image has been an emerging field.
We present a novel task of 3D visual grounding in single-view RGBD image where the referred objects are often only partially scanned due to occlusion.
Our approach first fuses the language and the visual features at the bottom level to generate a heatmap that localizes the relevant regions in the RGBD image.
Then our approach conducts an adaptive feature learning based on the heatmap and performs the object-level matching with another visio-linguistic fusion to finally ground the referred object.
arXiv Detail & Related papers (2021-03-14T11:18:50Z) - City-Scale Visual Place Recognition with Deep Local Features Based on
Multi-Scale Ordered VLAD Pooling [5.274399407597545]
We present a fully-automated system for place recognition at a city-scale based on content-based image retrieval.
Firstly, we take a comprehensive analysis of visual place recognition and sketch out the unique challenges of the task.
Next, we propose yet a simple pooling approach on top of convolutional neural network activations to embed the spatial information into the image representation vector.
arXiv Detail & Related papers (2020-09-19T15:21:59Z) - Learning to Visually Navigate in Photorealistic Environments Without any
Supervision [37.22924101745505]
We introduce a novel approach for learning to navigate from image inputs without external supervision or reward.
Our approach consists of three stages: learning a good representation of first-person views, then learning to explore using memory, and finally learning to navigate by setting its own goals.
We show the benefits of our approach by training an agent to navigate challenging photo-realistic environments from the Gibson dataset with RGB inputs only.
arXiv Detail & Related papers (2020-04-10T08:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.