Learning Navigational Visual Representations with Semantic Map
Supervision
- URL: http://arxiv.org/abs/2307.12335v1
- Date: Sun, 23 Jul 2023 14:01:05 GMT
- Title: Learning Navigational Visual Representations with Semantic Map
Supervision
- Authors: Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernoncourt, Trung Bui,
Stephen Gould, Hao Tan
- Abstract summary: We propose a navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps.
Ego$2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation.
- Score: 85.91625020847358
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Being able to perceive the semantics and the spatial structure of the
environment is essential for visual navigation of a household robot. However,
most existing works only employ visual backbones pre-trained either with
independent images for classification or with self-supervised learning methods
to adapt to the indoor navigation domain, neglecting the spatial relationships
that are essential to the learning of navigation. Inspired by the behavior that
humans naturally build semantically and spatially meaningful cognitive maps in
their brains during navigation, in this paper, we propose a novel
navigational-specific visual representation learning method by contrasting the
agent's egocentric views and semantic maps (Ego$^2$-Map). We apply the visual
transformer as the backbone encoder and train the model with data collected
from the large-scale Habitat-Matterport3D environments. Ego$^2$-Map learning
transfers the compact and rich information from a map, such as objects,
structure and transition, to the agent's egocentric representations for
navigation. Experiments show that agents using our learned representations on
object-goal navigation outperform recent visual pre-training methods. Moreover,
our representations significantly improve vision-and-language navigation in
continuous environments for both high-level and low-level action spaces,
achieving new state-of-the-art results of 47% SR and 41% SPL on the test
server.
Related papers
- Visuospatial navigation without distance, prediction, or maps [1.3812010983144802]
We show the sufficiency of a minimal feedforward framework in a classic visual navigation task.
While visual distance enables direct trajectories to the goal, two distinct algorithms develop to robustly navigate using visual angles alone.
Each of the three confers unique contextual tradeoffs as well as aligns with movement behavior observed in rodents, insects, fish, and sperm cells.
arXiv Detail & Related papers (2024-07-18T14:07:44Z) - NavHint: Vision and Language Navigation Agent with a Hint Generator [31.322331792911598]
We provide indirect supervision to the navigation agent through a hint generator that provides detailed visual descriptions.
The hint generator assists the navigation agent in developing a global understanding of the visual environment.
We evaluate our method on the R2R and R4R datasets and achieve state-of-the-art on several metrics.
arXiv Detail & Related papers (2024-02-04T16:23:16Z) - Interactive Semantic Map Representation for Skill-based Visual Object
Navigation [43.71312386938849]
This paper introduces a new representation of a scene semantic map formed during the embodied agent interaction with the indoor environment.
We have implemented this representation into a full-fledged navigation approach called SkillTron.
The proposed approach makes it possible to form both intermediate goals for robot exploration and the final goal for object navigation.
arXiv Detail & Related papers (2023-11-07T16:30:12Z) - Object Goal Navigation with Recursive Implicit Maps [92.6347010295396]
We propose an implicit spatial map for object goal navigation.
Our method significantly outperforms the state of the art on the challenging MP3D dataset.
We deploy our model on a real robot and achieve encouraging object goal navigation results in real scenes.
arXiv Detail & Related papers (2023-08-10T14:21:33Z) - Learning with a Mole: Transferable latent spatial representations for
navigation without reconstruction [12.845774297648736]
In most end-to-end learning approaches the representation is latent and usually does not have a clearly defined interpretation.
In this work we propose to learn an actionable representation of the scene independently of the targeted downstream task.
The learned representation is optimized by a blind auxiliary agent trained to navigate with it on multiple short sub episodes branching out from a waypoint.
arXiv Detail & Related papers (2023-06-06T16:51:43Z) - Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language
Navigation [87.52136927091712]
We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions.
To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects.
We propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively.
arXiv Detail & Related papers (2022-10-14T04:23:27Z) - PONI: Potential Functions for ObjectGoal Navigation with
Interaction-free Learning [125.22462763376993]
We propose Potential functions for ObjectGoal Navigation with Interaction-free learning (PONI)
PONI disentangles the skills of where to look?' for an object and how to navigate to (x, y)?'
arXiv Detail & Related papers (2022-01-25T01:07:32Z) - Deep Learning for Embodied Vision Navigation: A Survey [108.13766213265069]
"Embodied visual navigation" problem requires an agent to navigate in a 3D environment mainly rely on its first-person observation.
This paper attempts to establish an outline of the current works in the field of embodied visual navigation by providing a comprehensive literature survey.
arXiv Detail & Related papers (2021-07-07T12:09:04Z) - MaAST: Map Attention with Semantic Transformersfor Efficient Visual
Navigation [4.127128889779478]
This work focuses on performing better or comparable to the existing learning-based solutions for visual navigation for autonomous agents.
We propose a method to encode vital scene semantics into a semantically informed, top-down egocentric map representation.
We conduct experiments on 3-D reconstructed indoor PointGoal visual navigation and demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-03-21T12:01:23Z) - Occupancy Anticipation for Efficient Exploration and Navigation [97.17517060585875]
We propose occupancy anticipation, where the agent uses its egocentric RGB-D observations to infer the occupancy state beyond the visible regions.
By exploiting context in both the egocentric views and top-down maps our model successfully anticipates a broader map of the environment.
Our approach is the winning entry in the 2020 Habitat PointNav Challenge.
arXiv Detail & Related papers (2020-08-21T03:16:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.