VTNet: Visual Transformer Network for Object Goal Navigation
- URL: http://arxiv.org/abs/2105.09447v1
- Date: Thu, 20 May 2021 01:23:15 GMT
- Title: VTNet: Visual Transformer Network for Object Goal Navigation
- Authors: Heming Du, Xin Yu, Liang Zheng
- Abstract summary: We introduce a Visual Transformer Network (VTNet) for learning informative visual representation in navigation.
In a nutshell, VTNet embeds object and region features with their location cues as spatial-aware descriptors.
Experiments in the artificial environment AI2-Thor demonstrate that VTNet significantly outperforms state-of-the-art methods in unseen testing environments.
- Score: 36.15625223586484
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Object goal navigation aims to steer an agent towards a target object based
on observations of the agent. It is of pivotal importance to design effective
visual representations of the observed scene in determining navigation actions.
In this paper, we introduce a Visual Transformer Network (VTNet) for learning
informative visual representation in navigation. VTNet is a highly effective
structure that embodies two key properties for visual representations: First,
the relationships among all the object instances in a scene are exploited;
Second, the spatial locations of objects and image regions are emphasized so
that directional navigation signals can be learned. Furthermore, we also
develop a pre-training scheme to associate the visual representations with
navigation signals, and thus facilitate navigation policy learning. In a
nutshell, VTNet embeds object and region features with their location cues as
spatial-aware descriptors and then incorporates all the encoded descriptors
through attention operations to achieve informative representation for
navigation. Given such visual representations, agents are able to explore the
correlations between visual observations and navigation actions. For example,
an agent would prioritize "turning right" over "turning left" when the visual
representation emphasizes on the right side of activation map. Experiments in
the artificial environment AI2-Thor demonstrate that VTNet significantly
outperforms state-of-the-art methods in unseen testing environments.
Related papers
- Aligning Knowledge Graph with Visual Perception for Object-goal Navigation [16.32780793344835]
We propose the Aligning Knowledge Graph with Visual Perception (AKGVP) method for object-goal navigation.
Our approach introduces continuous modeling of the hierarchical scene architecture and leverages visual-language pre-training to align natural language description with visual perception.
The integration of a continuous knowledge graph architecture and multimodal feature alignment empowers the navigator with a remarkable zero-shot navigation capability.
arXiv Detail & Related papers (2024-02-29T06:31:18Z) - NavHint: Vision and Language Navigation Agent with a Hint Generator [31.322331792911598]
We provide indirect supervision to the navigation agent through a hint generator that provides detailed visual descriptions.
The hint generator assists the navigation agent in developing a global understanding of the visual environment.
We evaluate our method on the R2R and R4R datasets and achieve state-of-the-art on several metrics.
arXiv Detail & Related papers (2024-02-04T16:23:16Z) - Learning Navigational Visual Representations with Semantic Map
Supervision [85.91625020847358]
We propose a navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps.
Ego$2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation.
arXiv Detail & Related papers (2023-07-23T14:01:05Z) - KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation [61.08389704326803]
Vision-and-language navigation (VLN) is the task to enable an embodied agent to navigate to a remote location following the natural language instruction in real scenes.
Most of the previous approaches utilize the entire features or object-centric features to represent navigable candidates.
We propose a Knowledge Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent navigation ability.
arXiv Detail & Related papers (2023-03-28T08:00:46Z) - Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language
Navigation [87.52136927091712]
We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions.
To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects.
We propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively.
arXiv Detail & Related papers (2022-10-14T04:23:27Z) - SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language
Navigation [57.12508968239015]
This work presents a transformer-based vision-and-language navigation (VLN) agent.
It uses two different visual encoders -- a scene classification network and an object detector.
Scene features contribute high-level contextual information that supports object-level processing.
arXiv Detail & Related papers (2021-10-27T03:29:34Z) - Pushing it out of the Way: Interactive Visual Navigation [62.296686176988125]
We study the problem of interactive navigation where agents learn to change the environment to navigate more efficiently to their goals.
We introduce the Neural Interaction Engine (NIE) to explicitly predict the change in the environment caused by the agent's actions.
By modeling the changes while planning, we find that agents exhibit significant improvements in their navigational capabilities.
arXiv Detail & Related papers (2021-04-28T22:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.