Related papers: When Engineering Outruns Intelligence: A Re-evaluation of Instruction-Guided Navigation

When Engineering Outruns Intelligence: A Re-evaluation of Instruction-Guided Navigation

URL: http://arxiv.org/abs/2507.20021v1
Date: Sat, 26 Jul 2025 17:37:15 GMT
Title: When Engineering Outruns Intelligence: A Re-evaluation of Instruction-Guided Navigation
Authors: Matin Aghaei, Mohammad Ali Alomrani, Yingxue Zhang, Mahdi Biparva,
Abstract summary: We strip InstructNav of its Dynamic Chain-of-Navigation prompt, open-vocabulary GLEE detector and Intuition saliency map, and replace them with a simple Distance-Weighted Frontier Explorer (DWFE)<n>This geometry-only raises Success from 58.0% to 61.1% and lifts SPL from 20.9% to 36.4% over 2 000 validation episodes.<n>Our results indicate that frontier geometry, not emergent LLM reasoning, drives most reported gains.
Score: 9.31776371252164
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) are often credited with recent leaps in ObjectGoal Navigation, yet the extent to which they improve planning remains unclear. We revisit this question on the HM3D-v1 validation split. First, we strip InstructNav of its Dynamic Chain-of-Navigation prompt, open-vocabulary GLEE detector and Intuition saliency map, and replace them with a simple Distance-Weighted Frontier Explorer (DWFE). This geometry-only heuristic raises Success from 58.0% to 61.1% and lifts SPL from 20.9% to 36.0% over 2 000 validation episodes, outperforming all previous training-free baselines. Second, we add a lightweight language prior (SHF); on a 200-episode subset this yields a further +2% Success and +0.9% SPL while shortening paths by five steps on average. Qualitative trajectories confirm the trend: InstructNav back-tracks and times-out, DWFE reaches the goal after a few islands, and SHF follows an almost straight route. Our results indicate that frontier geometry, not emergent LLM reasoning, drives most reported gains, and suggest that metric-aware prompts or offline semantic graphs are necessary before attributing navigation success to "LLM intelligence."

Related papers

RATE-Nav: Region-Aware Termination Enhancement for Zero-shot Object Navigation with Vision-Language Models [36.39389224168802]
A critical but underexplored direction is the timely termination of exploration to overcome these challenges.<n>We propose RATE-Nav, a Region-Aware Termination-Enhanced method.<n>It includes a geometric predictive region segmentation algorithm and region-Based exploration estimation algorithm for exploration rate calculation.<n>It achieves a success rate of 67.8% and an SPL of 31.3% on the HM3D dataset.
arXiv Detail & Related papers (2025-06-03T01:15:00Z)
EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation [111.0993686148283]
We propose a novel sElf-improving embodied reasoning framework for boosting Vision-Language Navigation, dubbed EvolveNav.<n>Our EvolveNav consists of two stages: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with formalized CoT labels to activate the model's navigational reasoning capabilities and increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity.
arXiv Detail & Related papers (2025-06-02T11:28:32Z)
DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation [45.87909960783996]
DORAEMON is a cognitive-inspired framework consisting of Ventral and Dorsal Streams that mimics human navigation capabilities.<n>We evaluate DORAEMON on the HM3D, MP3D and GOAT datasets, where it achieves state-of-the-art performance on both success rate (SR) and success weighted by path length (SPL) metrics.
arXiv Detail & Related papers (2025-05-28T04:46:13Z)
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel [83.7466618084902]
We introduce a Self-Refining Data Flywheel (SRDF) that generates high-quality and large-scale navigational instruction-trajectory pairs.<n>Our experiments demonstrate that after several flywheel rounds, the navigator elevates the performance boundary from 70% to 78% SPL on the classic R2R test set.<n>This process results in a superior generator, evidenced by a SPICE increase from 23.5 to 26.2, better than all previous VLN instruction generation methods.
arXiv Detail & Related papers (2024-12-11T15:32:24Z)
Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation [64.84996994779443]
We propose a novel Affordances-Oriented Planner for continuous vision-language navigation (VLN) task. Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making. Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2024-07-08T12:52:46Z)
NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning [97.88246428240872]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.<n>Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.<n>This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z)
VoroNav: Voronoi-based Zero-shot Object Navigation with Large Language Model [28.79971953667143]
VoroNav is a semantic exploration framework to extract exploratory paths and planning nodes from a semantic map constructed in real time. By harnessing topological and semantic information, VoroNav designs text-based descriptions of paths and images that are readily interpretable by a large language model.
arXiv Detail & Related papers (2024-01-05T08:05:07Z)
Explore before Moving: A Feasible Path Estimation and Memory Recalling Framework for Embodied Navigation [117.26891277593205]
We focus on the navigation and solve the problem of existing navigation algorithms lacking experience and common sense. Inspired by the human ability to think twice before moving and conceive several feasible paths to seek a goal in unfamiliar scenes, we present a route planning method named Path Estimation and Memory Recalling framework. We show strong experimental results of PEMR on the EmbodiedQA navigation task.
arXiv Detail & Related papers (2021-10-16T13:30:55Z)
Rethinking the Spatial Route Prior in Vision-and-Language Navigation [29.244758196643307]
Vision-and-language navigation (VLN) is a trending topic which aims to navigate an intelligent agent to an expected position through natural language instructions. This work addresses the task of VLN from a previously-ignored aspect, namely the spatial route prior of the navigation scenes.
arXiv Detail & Related papers (2021-10-12T03:55:43Z)
Waypoint Models for Instruction-guided Navigation in Continuous Environments [68.2912740006109]
We develop a class of language-conditioned waypoint prediction networks to examine this question. We measure task performance and estimated execution time on a profiled LoCoBot robot. Our models outperform prior work in VLN-CE and set a new state-of-the-art on the public leaderboard.
arXiv Detail & Related papers (2021-10-05T17:55:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.