KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2303.15796v1
- Date: Tue, 28 Mar 2023 08:00:46 GMT
- Title: KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation
- Authors: Xiangyang Li, Zihan Wang, Jiahao Yang, Yaowei Wang, Shuqiang Jiang
- Abstract summary: Vision-and-language navigation (VLN) is the task to enable an embodied agent to navigate to a remote location following the natural language instruction in real scenes.
Most of the previous approaches utilize the entire features or object-centric features to represent navigable candidates.
We propose a Knowledge Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent navigation ability.
- Score: 61.08389704326803
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-language navigation (VLN) is the task to enable an embodied agent
to navigate to a remote location following the natural language instruction in
real scenes. Most of the previous approaches utilize the entire features or
object-centric features to represent navigable candidates. However, these
representations are not efficient enough for an agent to perform actions to
arrive the target location. As knowledge provides crucial information which is
complementary to visible content, in this paper, we propose a Knowledge
Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent
navigation ability. Specifically, we first retrieve facts (i.e., knowledge
described by language descriptions) for the navigation views based on local
regions from the constructed knowledge base. The retrieved facts range from
properties of a single object (e.g., color, shape) to relationships between
objects (e.g., action, spatial position), providing crucial information for
VLN. We further present the KERM which contains the purification, fact-aware
interaction, and instruction-guided aggregation modules to integrate visual,
history, instruction, and fact features. The proposed KERM can automatically
select and gather crucial and relevant cues, obtaining more accurate action
prediction. Experimental results on the REVERIE, R2R, and SOON datasets
demonstrate the effectiveness of the proposed method.
Related papers
- Augmented Commonsense Knowledge for Remote Object Grounding [67.30864498454805]
We propose an augmented commonsense knowledge model (ACK) to leverage commonsense information as atemporal knowledge graph for improving agent navigation.
ACK consists of knowledge graph-aware cross-modal and concept aggregation modules to enhance visual representation and visual-textual data alignment.
We add a new pipeline for the commonsense-based decision-making process which leads to more accurate local action prediction.
arXiv Detail & Related papers (2024-06-03T12:12:33Z) - TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation [11.591176410027224]
This paper presents a Vision-Language Navigation (VLN) agent based on Large Language Models (LLMs)
We propose the Thinking, Interacting, and Action framework to compensate for the shortcomings of LLMs in environmental perception.
Our approach also outperformed some supervised learning-based methods, highlighting its efficacy in zero-shot navigation.
arXiv Detail & Related papers (2024-03-13T05:22:39Z) - Aligning Knowledge Graph with Visual Perception for Object-goal Navigation [16.32780793344835]
We propose the Aligning Knowledge Graph with Visual Perception (AKGVP) method for object-goal navigation.
Our approach introduces continuous modeling of the hierarchical scene architecture and leverages visual-language pre-training to align natural language description with visual perception.
The integration of a continuous knowledge graph architecture and multimodal feature alignment empowers the navigator with a remarkable zero-shot navigation capability.
arXiv Detail & Related papers (2024-02-29T06:31:18Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - Learning Navigational Visual Representations with Semantic Map
Supervision [85.91625020847358]
We propose a navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps.
Ego$2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation.
arXiv Detail & Related papers (2023-07-23T14:01:05Z) - Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language
Navigation [87.52136927091712]
We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions.
To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects.
We propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively.
arXiv Detail & Related papers (2022-10-14T04:23:27Z) - PONI: Potential Functions for ObjectGoal Navigation with
Interaction-free Learning [125.22462763376993]
We propose Potential functions for ObjectGoal Navigation with Interaction-free learning (PONI)
PONI disentangles the skills of where to look?' for an object and how to navigate to (x, y)?'
arXiv Detail & Related papers (2022-01-25T01:07:32Z) - Embodied Learning for Lifelong Visual Perception [33.02424587900808]
We study lifelong visual perception in an embodied setup, where we develop new models and compare various agents that navigate in buildings.
The purpose of the agents is to recognize objects and other semantic classes in the whole building at the end of a process that combines exploration and active visual learning.
arXiv Detail & Related papers (2021-12-28T10:47:13Z) - Visual Navigation with Spatial Attention [26.888916048408895]
This work focuses on object goal visual navigation, aiming at finding the location of an object from a given class.
We propose to learn the agent's policy using a reinforcement learning algorithm.
Our key contribution is a novel attention probability model for visual navigation tasks.
arXiv Detail & Related papers (2021-04-20T07:39:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.