Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation
- URL: http://arxiv.org/abs/2403.11541v1
- Date: Mon, 18 Mar 2024 07:51:22 GMT
- Title: Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation
- Authors: Ming Xu, Zilong Xie,
- Abstract summary: Most Vision-and-Language Navigation (VLN) algorithms tend to make decision errors, primarily due to a lack of visual common sense and insufficient reasoning capabilities.
This paper proposes a Hierarchical Spatial Proximity Reasoning (HSPR) model to address this issue.
We conduct experiments on publicly available datasets including REVERIE, SOON, R2R, and R4R to validate the effectiveness of the proposed approach.
- Score: 1.2473780585666772
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most Vision-and-Language Navigation (VLN) algorithms tend to make decision errors, primarily due to a lack of visual common sense and insufficient reasoning capabilities. To address this issue, this paper proposes a Hierarchical Spatial Proximity Reasoning (HSPR) model. Firstly, we design a Scene Understanding Auxiliary Task (SUAT) to assist the agent in constructing a knowledge base of hierarchical spatial proximity for reasoning navigation. Specifically, this task utilizes panoramic views and object features to identify regions in the navigation environment and uncover the adjacency relationships between regions, objects, and region-object pairs. Secondly, we dynamically construct a semantic topological map through agent-environment interactions and propose a Multi-step Reasoning Navigation Algorithm (MRNA) based on the map. This algorithm continuously plans various feasible paths from one region to another, utilizing the constructed proximity knowledge base, enabling more efficient exploration. Additionally, we introduce a Proximity Adaptive Attention Module (PAAM) and Residual Fusion Method (RFM) to enable the model to obtain more accurate navigation decision confidence. Finally, we conduct experiments on publicly available datasets including REVERIE, SOON, R2R, and R4R to validate the effectiveness of the proposed approach.
Related papers
- SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation [83.4599149936183]
Existing zero-shot object navigation methods prompt LLM with the text of spatially closed objects.
We propose to represent the observed scene with 3D scene graph.
We conduct extensive experiments on MP3D, HM3D and RoboTHOR environments, where SG-Nav surpasses previous state-of-the-art zero-shot methods by more than 10% SR on all benchmarks.
arXiv Detail & Related papers (2024-10-10T17:57:19Z) - PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation [30.710806048991923]
Vision and language navigation is a task that requires an agent to navigate according to a natural language instruction.
Recent methods predict sub-goals on constructed topology map at each step to enable long-term action planning.
We propose an alternative method that facilitates navigation planning by considering the alignment between instructions and directed fidelity trajectories.
arXiv Detail & Related papers (2024-07-16T08:22:18Z) - Aligning Knowledge Graph with Visual Perception for Object-goal Navigation [16.32780793344835]
We propose the Aligning Knowledge Graph with Visual Perception (AKGVP) method for object-goal navigation.
Our approach introduces continuous modeling of the hierarchical scene architecture and leverages visual-language pre-training to align natural language description with visual perception.
The integration of a continuous knowledge graph architecture and multimodal feature alignment empowers the navigator with a remarkable zero-shot navigation capability.
arXiv Detail & Related papers (2024-02-29T06:31:18Z) - Probable Object Location (POLo) Score Estimation for Efficient Object
Goal Navigation [15.623723522165731]
We introduce a novel framework centered around the Probable Object Location (POLo) score.
We further enhance the framework's practicality by introducing POLoNet, a neural network trained to approximate the computationally intensive POLo score.
Our experiments, involving the first phase of the OVMM 2023 challenge, demonstrate that an agent equipped with POLoNet significantly outperforms a range of baseline methods.
arXiv Detail & Related papers (2023-11-14T08:45:32Z) - KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation [61.08389704326803]
Vision-and-language navigation (VLN) is the task to enable an embodied agent to navigate to a remote location following the natural language instruction in real scenes.
Most of the previous approaches utilize the entire features or object-centric features to represent navigable candidates.
We propose a Knowledge Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent navigation ability.
arXiv Detail & Related papers (2023-03-28T08:00:46Z) - Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation
Using Scene Object Spectrum Grounding [16.784045122994506]
We propose a hierarchical navigation method deploying an exploitation policy to correct misled recent actions.
We show that an exploitation policy, which moves the agent toward a well-chosen local goal, outperforms a method which moves the agent to a previously visited state.
We present a novel visual representation, called scene object spectrum (SOS), which performs category-wise 2D Fourier transform of detected objects.
arXiv Detail & Related papers (2023-03-07T17:39:53Z) - Explore before Moving: A Feasible Path Estimation and Memory Recalling
Framework for Embodied Navigation [117.26891277593205]
We focus on the navigation and solve the problem of existing navigation algorithms lacking experience and common sense.
Inspired by the human ability to think twice before moving and conceive several feasible paths to seek a goal in unfamiliar scenes, we present a route planning method named Path Estimation and Memory Recalling framework.
We show strong experimental results of PEMR on the EmbodiedQA navigation task.
arXiv Detail & Related papers (2021-10-16T13:30:55Z) - SOON: Scenario Oriented Object Navigation with Graph-based Exploration [102.74649829684617]
The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the 'holy grail' goals of intelligent robots.
Most visual navigation benchmarks focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step.
This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere.
arXiv Detail & Related papers (2021-03-31T15:01:04Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z) - Neural Topological SLAM for Visual Navigation [112.73876869904]
We design topological representations for space that leverage semantics and afford approximate geometric reasoning.
We describe supervised learning-based algorithms that can build, maintain and use such representations under noisy actuation.
arXiv Detail & Related papers (2020-05-25T17:56:29Z) - Learning hierarchical relationships for object-goal navigation [7.074818959144171]
We present Memory-utilized Joint hierarchical Object Learning for Navigation in Indoor Rooms (MJOLNIR)
MJOLNIR is a target-driven navigation algorithm, which considers the inherent relationship between target objects, and the more salient contextual objects occurring in its surrounding.
Our model learns to converge much faster than other algorithms, without suffering from the well-known overfitting problem.
arXiv Detail & Related papers (2020-03-15T04:01:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.