Related papers: ReasonNavi: Human-Inspired Global Map Reasoning for Zero-Shot Embodied Navigation

ReasonNavi: Human-Inspired Global Map Reasoning for Zero-Shot Embodied Navigation

URL: http://arxiv.org/abs/2602.15864v1
Date: Mon, 26 Jan 2026 19:09:20 GMT
Title: ReasonNavi: Human-Inspired Global Map Reasoning for Zero-Shot Embodied Navigation
Authors: Yuzhuo Ao, Anbang Wang, Yu-Wing Tai, Chi-Keung Tang,
Abstract summary: Embodied agents often struggle with efficient navigation because they rely primarily on partial egocentric observations.<n>We introduce ReasonNavi, a human-inspired framework that operationalizes this reason-then-act paradigm by coupling Multimodal Large Language Models (MLLMs) with deterministic planners.
Score: 53.95797153529148
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Embodied agents often struggle with efficient navigation because they rely primarily on partial egocentric observations, which restrict global foresight and lead to inefficient exploration. In contrast, humans plan using maps: we reason globally first, then act locally. We introduce ReasonNavi, a human-inspired framework that operationalizes this reason-then-act paradigm by coupling Multimodal Large Language Models (MLLMs) with deterministic planners. ReasonNavi converts a top-down map into a discrete reasoning space by room segmentation and candidate target nodes sampling. An MLLM is then queried in a multi-stage process to identify the candidate most consistent with the instruction (object, image, or text goal), effectively leveraging the model's semantic reasoning ability while sidestepping its weakness in continuous coordinate prediction. The selected waypoint is grounded into executable trajectories using a deterministic action planner over an online-built occupancy map, while pretrained object detectors and segmenters ensure robust recognition at the goal. This yields a unified zero-shot navigation framework that requires no MLLM fine-tuning, circumvents the brittleness of RL-based policies and scales naturally with foundation model improvements. Across three navigation tasks, ReasonNavi consistently outperforms prior methods that demand extensive training or heavy scene modeling, offering a scalable, interpretable, and globally grounded solution to embodied navigation. Project page: https://reasonnavi.github.io/

Related papers

FeudalNav: A Simple Framework for Visual Navigation [7.136542835931238]
We develop a hierarchical framework that decomposes the navigation decision-making process into multiple levels.<n>Our method learns to select subgoals through a simple, transferable waypoint selection network.<n>We show competitive results with a suite of SOTA methods in Habitat AI environments without using any odometry in training or inference.
arXiv Detail & Related papers (2026-01-15T22:10:29Z)
FOM-Nav: Frontier-Object Maps for Object Goal Navigation [65.76906445210112]
FOM-Nav is a framework that enhances exploration efficiency through Frontier-Object Maps and vision-language models.<n>To train FOM-Nav, we automatically construct large-scale navigation datasets from real-world scanned environments.<n> FOM-Nav achieves state-of-the-art performance on the MP3D and HM3D benchmarks, particularly in navigation efficiency metric SPL.
arXiv Detail & Related papers (2025-11-30T18:16:09Z)
DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation [73.80968452950854]
Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural language instructions through free-form 3D spaces.<n>Existing VLN-CE approaches typically use a two-stage waypoint planning framework.<n>We propose DAgger Diffusion Navigation (DifNav) as an end-to-end optimized VLN-CE policy.
arXiv Detail & Related papers (2025-08-13T02:51:43Z)
TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation [52.422619828854984]
We introduce TopV-Nav, an MLLM-based method that directly reasons on the top-view map with sufficient spatial information.<n>To fully unlock the MLLM's spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method.
arXiv Detail & Related papers (2024-11-25T14:27:55Z)
NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning [97.88246428240872]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.<n>Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.<n>This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z)
NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models [17.495162643127003]
We introduce the NavGPT to reveal the reasoning capability of GPT models in complex embodied scenes. NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status. We show that NavGPT is capable of generating high-quality navigational instructions from observations and actions along a path.
arXiv Detail & Related papers (2023-05-26T14:41:06Z)
How To Not Train Your Dragon: Training-free Embodied Object Goal Navigation with Semantic Frontiers [94.46825166907831]
We present a training-free solution to tackle the object goal navigation problem in Embodied AI. Our method builds a structured scene representation based on the classic visual simultaneous localization and mapping (V-SLAM) framework. Our method propagates semantics on the scene graphs based on language priors and scene statistics to introduce semantic knowledge to the geometric frontiers.
arXiv Detail & Related papers (2023-05-26T13:38:33Z)
Learning Synthetic to Real Transfer for Localization and Navigational Tasks [7.019683407682642]
Navigation is at the crossroad of multiple disciplines, it combines notions of computer vision, robotics and control. This work aimed at creating, in a simulation, a navigation pipeline whose transfer to the real world could be done with as few efforts as possible. To design the navigation pipeline four main challenges arise; environment, localization, navigation and planning.
arXiv Detail & Related papers (2020-11-20T08:37:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.