Related papers: Object Navigation with Structure-Semantic Reasoning-Based Multi-level Map and Multimodal Decision-Making LLM

Object Navigation with Structure-Semantic Reasoning-Based Multi-level Map and Multimodal Decision-Making LLM

URL: http://arxiv.org/abs/2506.05896v1
Date: Fri, 06 Jun 2025 09:08:40 GMT
Title: Object Navigation with Structure-Semantic Reasoning-Based Multi-level Map and Multimodal Decision-Making LLM
Authors: Chongshang Yan, Jiaxuan He, Delun Li, Yi Yang, Wenjie Song,
Abstract summary: We propose an active object navigation framework with Environmental Attributes Map (EAM) and MLLM Hierarchical Reasoning module (MHR)<n>EAM is constructed by reasoning observed environments with SBERT and predicting unobserved ones with Diffusion.<n>MHR is inspired by EAM to perform frontier exploration decision-making, avoiding the circuitous trajectories in long-range scenarios to improve path efficiency.
Score: 18.406869393228813
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The zero-shot object navigation (ZSON) in unknown open-ended environments coupled with semantically novel target often suffers from the significant decline in performance due to the neglect of high-dimensional implicit scene information and the long-range target searching task. To address this, we proposed an active object navigation framework with Environmental Attributes Map (EAM) and MLLM Hierarchical Reasoning module (MHR) to improve its success rate and efficiency. EAM is constructed by reasoning observed environments with SBERT and predicting unobserved ones with Diffusion, utilizing human space regularities that underlie object-room correlations and area adjacencies. MHR is inspired by EAM to perform frontier exploration decision-making, avoiding the circuitous trajectories in long-range scenarios to improve path efficiency. Experimental results demonstrate that the EAM module achieves 64.5\% scene mapping accuracy on MP3D dataset, while the navigation task attains SPLs of 28.4\% and 26.3\% on HM3D and MP3D benchmarks respectively - representing absolute improvements of 21.4\% and 46.0\% over baseline methods.

Related papers

MSSDF: Modality-Shared Self-supervised Distillation for High-Resolution Multi-modal Remote Sensing Image Learning [25.381211868583826]
We propose a multi-modal self-supervised learning framework that leverages high-resolution RGB images, multi-spectral data, and digital surface models (DSM) for pre-training.<n>We evaluate the proposed method on multiple downstream tasks, covering typical remote sensing applications such as scene classification, semantic segmentation, change detection, object detection, and depth estimation.
arXiv Detail & Related papers (2025-06-11T02:01:36Z)
Diffusion as Reasoning: Enhancing Object Goal Navigation with LLM-Biased Diffusion Model [9.939998139837426]
We propose a new approach to solving the ObjectNav task, by training a diffusion model to learn the statistical distribution patterns of objects in semantic maps. We also propose the global target bias and local LLM bias methods, where the former can constrain the diffusion model to generate the target object more effectively. Based on the generated map in the unknown region, the agent sets the predicted location of the target as the goal and moves towards it.
arXiv Detail & Related papers (2024-10-29T08:10:06Z)
Semantic Environment Atlas for Object-Goal Navigation [12.057544558656035]
We introduce the Semantic Environment Atlas (SEA), a novel mapping approach designed to enhance visual navigation capabilities of embodied agents. The SEA integrates multiple semantic maps from various environments, retaining a memory of place-object relationships. Our method achieves a success rate of 39.0%, an improvement of 12.4% over the current state-of-the-art, but also maintains robustness under noisy odometry and actuation conditions.
arXiv Detail & Related papers (2024-10-05T00:37:15Z)
Learning Spatial-Semantic Features for Robust Video Object Segmentation [108.045326229865]
We propose a robust video object segmentation framework that learns spatial-semantic features and discriminative object queries.<n>The proposed method achieves state-of-the-art performance on benchmark data sets, including the DAVIS 2017 test (textbf87.8%), YoutubeVOS 2019 (textbf88.1%), MOSE val (textbf74.0%), and LVOS test (textbf73.0%)
arXiv Detail & Related papers (2024-07-10T15:36:00Z)
Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation [64.84996994779443]
We propose a novel Affordances-Oriented Planner for continuous vision-language navigation (VLN) task. Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making. Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2024-07-08T12:52:46Z)
SOOD++: Leveraging Unlabeled Data to Boost Oriented Object Detection [59.868772767818975]
We propose a simple yet effective Semi-supervised Oriented Object Detection method termed SOOD++. Specifically, we observe that objects from aerial images are usually arbitrary orientations, small scales, and aggregation. Extensive experiments conducted on various multi-oriented object datasets under various labeled settings demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2024-07-01T07:03:51Z)
Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation [1.2473780585666772]
Most Vision-and-Language Navigation (VLN) algorithms are prone to making inaccurate decisions due to their lack of visual common sense and limited reasoning capabilities. We propose a Hierarchical Spatial Proximity Reasoning (HSPR) method to help the agent build a knowledge base of hierarchical spatial proximity. We validate our approach with experiments on publicly available datasets including REVERIE, SOON, R2R, and R4R.
arXiv Detail & Related papers (2024-03-18T07:51:22Z)
Right Place, Right Time! Dynamizing Topological Graphs for Embodied Navigation [55.581423861790945]
Embodied Navigation tasks often involve constructing topological graphs of a scene during exploration.<n>We introduce structured object transitions to dynamize static topological graphs called Object Transition Graphs (OTGs)<n>OTGs simulate portable targets following structured routes inspired by human habits.
arXiv Detail & Related papers (2024-03-14T22:33:22Z)
FIT-SLAM -- Fisher Information and Traversability estimation-based Active SLAM for exploration in 3D environments [1.4474137122906163]
Active visual SLAM finds a wide array of applications in-Denied sub-terrain environments and outdoor environments for ground robots. It is imperative to incorporate the perception considerations in the goal selection and path planning towards the goal during an exploration mission. We propose FIT-SLAM, a new exploration method tailored for unmanned ground vehicles (UGVs) to explore 3D environments.
arXiv Detail & Related papers (2024-01-17T16:46:38Z)
Comparison of Model-Free and Model-Based Learning-Informed Planning for PointGoal Navigation [10.797100163772482]
We compare state-of-the-art Deep Reinforcement Learning based approaches with Partially Observable Markov Decision Process (POMDP) formulation of the point goal navigation problem. We show comparable, though slightly worse performance than the SOTA DD-PPO approach, yet with far fewer data.
arXiv Detail & Related papers (2022-12-17T05:23:54Z)
Learning Space Partitions for Path Planning [54.475949279050596]
PlaLaM outperforms existing path planning methods in 2D navigation tasks, especially in the presence of difficult-to-escape local optima. These gains transfer to highly multimodal real-world tasks, where we outperform strong baselines in compiler phase ordering by up to 245% and in molecular design by up to 0.4 on properties on a 0-1 scale.
arXiv Detail & Related papers (2021-06-19T18:06:11Z)
Occupancy Anticipation for Efficient Exploration and Navigation [97.17517060585875]
We propose occupancy anticipation, where the agent uses its egocentric RGB-D observations to infer the occupancy state beyond the visible regions. By exploiting context in both the egocentric views and top-down maps our model successfully anticipates a broader map of the environment. Our approach is the winning entry in the 2020 Habitat PointNav Challenge.
arXiv Detail & Related papers (2020-08-21T03:16:51Z)
Object Goal Navigation using Goal-Oriented Semantic Exploration [98.14078233526476]
This work studies the problem of object goal navigation which involves navigating to an instance of the given object category in unseen environments. We propose a modular system called, Goal-Oriented Semantic Exploration' which builds an episodic semantic map and uses it to explore the environment efficiently.
arXiv Detail & Related papers (2020-07-01T17:52:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.