Related papers: Nav-$R^2$ Dual-Relation Reasoning for Generalizable Open-Vocabulary Object-Goal Navigation

Nav-$R^2$ Dual-Relation Reasoning for Generalizable Open-Vocabulary Object-Goal Navigation

URL: http://arxiv.org/abs/2512.02400v1
Date: Tue, 02 Dec 2025 04:21:02 GMT
Title: Nav-$R^2$ Dual-Relation Reasoning for Generalizable Open-Vocabulary Object-Goal Navigation
Authors: Wentao Xiang, Haokang Zhang, Tianhang Yang, Zedong Chu, Ruihang Chu, Shichao Xie, Yujian Yuan, Jian Sun, Zhining Gu, Junjie Wang, Xiaolong Wu, Mu Xu, Yujiu Yang,
Abstract summary: Nav-$R2$ is a framework that explicitly models two types of relationships, target-environment modeling and environment-action planning.<n>Our SA-Mem preserves the most target-relevant and current observation-relevant features from both temporal and semantic perspectives.<n>Nav-R2 achieves state-of-the-art performance in localizing unseen objects through a streamlined and efficient pipeline.
Score: 67.68165784193556
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Object-goal navigation in open-vocabulary settings requires agents to locate novel objects in unseen environments, yet existing approaches suffer from opaque decision-making processes and low success rate on locating unseen objects. To address these challenges, we propose Nav-$R^2$, a framework that explicitly models two critical types of relationships, target-environment modeling and environment-action planning, through structured Chain-of-Thought (CoT) reasoning coupled with a Similarity-Aware Memory. We construct a Nav$R^2$-CoT dataset that teaches the model to perceive the environment, focus on target-related objects in the surrounding context and finally make future action plans. Our SA-Mem preserves the most target-relevant and current observation-relevant features from both temporal and semantic perspectives by compressing video frames and fusing historical observations, while introducing no additional parameters. Compared to previous methods, Nav-R^2 achieves state-of-the-art performance in localizing unseen objects through a streamlined and efficient pipeline, avoiding overfitting to seen object categories while maintaining real-time inference at 2Hz. Resources will be made publicly available at \href{https://github.com/AMAP-EAI/Nav-R2}{github link}.

Related papers

History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation [64.51891404034164]
Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments.<n>Existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects.<n>This work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline.
arXiv Detail & Related papers (2025-12-16T09:16:07Z)
RAVEN: Resilient Aerial Navigation via Open-Set Semantic Memory and Behavior Adaptation [20.730528223747967]
RAVEN is a 3D memory-based, behavior tree framework for aerial semantic navigation in unstructured outdoor environments.<n>It uses a spatially consistent semantic voxel-ray map as persistent memory, enabling long-horizon planning and avoiding purely reactive behaviors.<n>RAVEN outperforms baselines by 85.25% in simulation and demonstrate its real-world applicability through deployment on an aerial robot in outdoor field tests.
arXiv Detail & Related papers (2025-09-28T01:43:25Z)
DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation [73.80968452950854]
Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural language instructions through free-form 3D spaces.<n>Existing VLN-CE approaches typically use a two-stage waypoint planning framework.<n>We propose DAgger Diffusion Navigation (DifNav) as an end-to-end optimized VLN-CE policy.
arXiv Detail & Related papers (2025-08-13T02:51:43Z)
RSRNav: Reasoning Spatial Relationship for Image-Goal Navigation [57.197881161006904]
Recent image-goal navigation (ImageNav) methods learn a perception-action policy by separately capturing semantic features of the goal and egocentric images.<n>We propose RSRNav, a simple yet effective method that reasons spatial relationships between the goal and current observations as navigation guidance.
arXiv Detail & Related papers (2025-04-25T00:22:17Z)
Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation [1.2473780585666772]
Most Vision-and-Language Navigation (VLN) algorithms are prone to making inaccurate decisions due to their lack of visual common sense and limited reasoning capabilities. We propose a Hierarchical Spatial Proximity Reasoning (HSPR) method to help the agent build a knowledge base of hierarchical spatial proximity. We validate our approach with experiments on publicly available datasets including REVERIE, SOON, R2R, and R4R.
arXiv Detail & Related papers (2024-03-18T07:51:22Z)
TAS: A Transit-Aware Strategy for Embodied Navigation with Non-Stationary Targets [55.09248760290918]
We present a novel algorithm for navigation in dynamic scenarios with non-stationary targets.<n>Our novel Transit-Aware Strategy (TAS) enriches embodied navigation policies with object path information.<n> TAS improves performance in non-stationary environments by rewarding agents for synchronizing their routes with target routes.
arXiv Detail & Related papers (2024-03-14T22:33:22Z)
Interactive Semantic Map Representation for Skill-based Visual Object Navigation [43.71312386938849]
This paper introduces a new representation of a scene semantic map formed during the embodied agent interaction with the indoor environment. We have implemented this representation into a full-fledged navigation approach called SkillTron. The proposed approach makes it possible to form both intermediate goals for robot exploration and the final goal for object navigation.
arXiv Detail & Related papers (2023-11-07T16:30:12Z)
Zero-Shot Object Goal Visual Navigation With Class-Independent Relationship Network [3.0820097046465285]
"Zero-shot" means that the target the agent needs to find is not trained during the training phase. We propose the Class-Independent Relationship Network (CIRN) to address the issue of coupling navigation ability with target features during training. Our method outperforms the current state-of-the-art approaches in the zero-shot object goal visual navigation task.
arXiv Detail & Related papers (2023-10-15T16:42:14Z)
SOON: Scenario Oriented Object Navigation with Graph-based Exploration [102.74649829684617]
The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the 'holy grail' goals of intelligent robots. Most visual navigation benchmarks focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step. This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere.
arXiv Detail & Related papers (2021-03-31T15:01:04Z)
Object Goal Navigation using Goal-Oriented Semantic Exploration [98.14078233526476]
This work studies the problem of object goal navigation which involves navigating to an instance of the given object category in unseen environments. We propose a modular system called, Goal-Oriented Semantic Exploration' which builds an episodic semantic map and uses it to explore the environment efficiently.
arXiv Detail & Related papers (2020-07-01T17:52:32Z)
Learning hierarchical relationships for object-goal navigation [7.074818959144171]
We present Memory-utilized Joint hierarchical Object Learning for Navigation in Indoor Rooms (MJOLNIR) MJOLNIR is a target-driven navigation algorithm, which considers the inherent relationship between target objects, and the more salient contextual objects occurring in its surrounding. Our model learns to converge much faster than other algorithms, without suffering from the well-known overfitting problem.
arXiv Detail & Related papers (2020-03-15T04:01:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.