DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation
- URL: http://arxiv.org/abs/2505.21969v3
- Date: Fri, 06 Jun 2025 03:29:36 GMT
- Title: DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation
- Authors: Tianjun Gu, Linfeng Li, Xuhong Wang, Chenghua Gong, Jingyu Gong, Zhizhong Zhang, Yuan Xie, Lizhuang Ma, Xin Tan,
- Abstract summary: DORAEMON is a cognitive-inspired framework consisting of Ventral and Dorsal Streams that mimics human navigation capabilities.<n>We evaluate DORAEMON on the HM3D, MP3D and GOAT datasets, where it achieves state-of-the-art performance on both success rate (SR) and success weighted by path length (SPL) metrics.
- Score: 45.87909960783996
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adaptive navigation in unfamiliar environments is crucial for household service robots but remains challenging due to the need for both low-level path planning and high-level scene understanding. While recent vision-language model (VLM) based zero-shot approaches reduce dependence on prior maps and scene-specific training data, they face significant limitations: spatiotemporal discontinuity from discrete observations, unstructured memory representations, and insufficient task understanding leading to navigation failures. We propose DORAEMON (Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation), a novel cognitive-inspired framework consisting of Ventral and Dorsal Streams that mimics human navigation capabilities. The Dorsal Stream implements the Hierarchical Semantic-Spatial Fusion and Topology Map to handle spatiotemporal discontinuities, while the Ventral Stream combines RAG-VLM and Policy-VLM to improve decision-making. Our approach also develops Nav-Ensurance to ensure navigation safety and efficiency. We evaluate DORAEMON on the HM3D, MP3D, and GOAT datasets, where it achieves state-of-the-art performance on both success rate (SR) and success weighted by path length (SPL) metrics, significantly outperforming existing methods. We also introduce a new evaluation metric (AORI) to assess navigation intelligence better. Comprehensive experiments demonstrate DORAEMON's effectiveness in zero-shot autonomous navigation without requiring prior map building or pre-training.
Related papers
- History-Augmented Vision-Language Models for Frontier-Based Zero-Shot Object Navigation [5.343932820859596]
This paper introduces a novel zero-shot ObjectNav framework that pioneers the use of dynamic, history-aware prompting.<n>Our core innovation lies in providing the VLM with action history context, enabling it to generate semantic guidance scores for navigation actions.<n>We also introduce a VLM-assisted waypoint generation mechanism for refining the final approach to detected objects.
arXiv Detail & Related papers (2025-06-19T21:50:16Z) - Hierarchical Reinforcement Learning for Safe Mapless Navigation with Congestion Estimation [7.339743259039457]
This paper introduces a safe mapless navigation framework utilizing hierarchical reinforcement learning (HRL) to enhance navigation through such areas.<n>The findings demonstrate that our HRL-based navigation framework excels in both static and dynamic scenarios.<n>We implement the HRL-based navigation framework on a TurtleBot3 robot for physical validation experiments.
arXiv Detail & Related papers (2025-03-15T08:03:50Z) - Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments [41.75629159747654]
We introduce Vision-Language Attention Distillation (Vi-LAD), a novel approach for distilling socially compliant navigation knowledge.<n>Vi-LAD fine-tunes a transformer-based model using intermediate attention maps extracted from a pre-trained vision-action model.<n>We validate our approach through real-world experiments on a Husky wheeled robot, demonstrating significant improvements over state-of-the-art navigation methods.
arXiv Detail & Related papers (2025-03-12T20:38:23Z) - NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning [97.88246428240872]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.<n>Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.<n>This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - VoroNav: Voronoi-based Zero-shot Object Navigation with Large Language
Model [28.79971953667143]
VoroNav is a semantic exploration framework to extract exploratory paths and planning nodes from a semantic map constructed in real time.
By harnessing topological and semantic information, VoroNav designs text-based descriptions of paths and images that are readily interpretable by a large language model.
arXiv Detail & Related papers (2024-01-05T08:05:07Z) - Mind the Gap: Improving Success Rate of Vision-and-Language Navigation
by Revisiting Oracle Success Routes [25.944819618283613]
Vision-and-Language Navigation (VLN) aims to navigate to the target location by following a given instruction.
We make the first attempt to tackle a long-ignored problem in VLN: narrowing the gap between Success Rate (SR) and Oracle Success Rate (OSR)
arXiv Detail & Related papers (2023-08-07T01:43:25Z) - ETPNav: Evolving Topological Planning for Vision-Language Navigation in
Continuous Environments [56.194988818341976]
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments.
We propose ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments.
ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets.
arXiv Detail & Related papers (2023-04-06T13:07:17Z) - Explore before Moving: A Feasible Path Estimation and Memory Recalling
Framework for Embodied Navigation [117.26891277593205]
We focus on the navigation and solve the problem of existing navigation algorithms lacking experience and common sense.
Inspired by the human ability to think twice before moving and conceive several feasible paths to seek a goal in unfamiliar scenes, we present a route planning method named Path Estimation and Memory Recalling framework.
We show strong experimental results of PEMR on the EmbodiedQA navigation task.
arXiv Detail & Related papers (2021-10-16T13:30:55Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z) - Occupancy Anticipation for Efficient Exploration and Navigation [97.17517060585875]
We propose occupancy anticipation, where the agent uses its egocentric RGB-D observations to infer the occupancy state beyond the visible regions.
By exploiting context in both the egocentric views and top-down maps our model successfully anticipates a broader map of the environment.
Our approach is the winning entry in the 2020 Habitat PointNav Challenge.
arXiv Detail & Related papers (2020-08-21T03:16:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.