Related papers: SeqWalker: Sequential-Horizon Vision-and-Language Navigation with Hierarchical Planning

SeqWalker: Sequential-Horizon Vision-and-Language Navigation with Hierarchical Planning

URL: http://arxiv.org/abs/2601.04699v1
Date: Thu, 08 Jan 2026 08:09:24 GMT
Title: SeqWalker: Sequential-Horizon Vision-and-Language Navigation with Hierarchical Planning
Authors: Zebin Han, Xudong Wang, Baichen Liu, Qi Lyu, Zhenduo Shang, Jiahua Dong, Lianqing Liu, Zhi Han,
Abstract summary: Sequential-Horizon Vision-and-Language Navigation (SH-VLN) presents a challenging scenario.<n>Current vision-and-language navigation models exhibit significant performance degradation with such multi-task instructions.<n>We propose SeqWalker, a navigation model built on a hierarchical planning framework.
Score: 20.14137096976666
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sequential-Horizon Vision-and-Language Navigation (SH-VLN) presents a challenging scenario where agents should sequentially execute multi-task navigation guided by complex, long-horizon language instructions. Current vision-and-language navigation models exhibit significant performance degradation with such multi-task instructions, as information overload impairs the agent's ability to attend to observationally relevant details. To address this problem, we propose SeqWalker, a navigation model built on a hierarchical planning framework. Our SeqWalker features: i) A High-Level Planner that dynamically selects global instructions into contextually relevant sub-instructions based on the agent's current visual observations, thus reducing cognitive load; ii) A Low-Level Planner incorporating an Exploration-Verification strategy that leverages the inherent logical structure of instructions for trajectory error correction. To evaluate SH-VLN performance, we also extend the IVLN dataset and establish a new benchmark. Extensive experiments are performed to demonstrate the superiority of the proposed SeqWalker.

Related papers

NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction [12.352236127154761]
We introduce NavForesee, a novel Vision-Language Model (VLM) that unifies high-level language planning and predictive world model imagination.<n>Our approach empowers a single VLM to concurrently perform planning and predictive foresight.<n>Our work highlights the immense potential of fusing explicit language planning with implicittemporal prediction, paving the way for more intelligent and capable embodied agents.
arXiv Detail & Related papers (2025-12-01T11:24:16Z)
Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents [43.5771856761934]
Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments.<n>We propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents.
arXiv Detail & Related papers (2025-08-11T05:50:30Z)
Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method [94.74003109176581]
Long-Horizon Vision-Language Navigation (LH-VLN) is a novel VLN task that emphasizes long-term planning and decision consistency across consecutive subtasks.<n>Our platform, benchmark and method supply LH-VLN with a robust data generation pipeline, comprehensive model evaluation dataset, reasonable metrics, and a novel VLN model.
arXiv Detail & Related papers (2024-12-12T09:08:13Z)
OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation [96.46961207887722]
OVER-NAV aims to go over and beyond the current arts of IVLN techniques. To fully exploit the interpreted navigation data, we introduce a structured representation, coded Omnigraph.
arXiv Detail & Related papers (2024-03-26T02:34:48Z)
Towards Unified Token Learning for Vision-Language Tracking [65.96561538356315]
We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task. Our proposed framework serializes language description and bounding box into a sequence of discrete tokens. In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
arXiv Detail & Related papers (2023-08-27T13:17:34Z)
$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models [89.64729024399634]
We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions. Normally, the instructions have complex grammatical structures and often contain various action descriptions. How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging.
arXiv Detail & Related papers (2023-08-15T19:01:19Z)
NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models [17.495162643127003]
We introduce the NavGPT to reveal the reasoning capability of GPT models in complex embodied scenes. NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status. We show that NavGPT is capable of generating high-quality navigational instructions from observations and actions along a path.
arXiv Detail & Related papers (2023-05-26T14:41:06Z)
ENTL: Embodied Navigation Trajectory Learner [37.43079415330256]
We propose a method for extracting long sequence representations for embodied navigation. We train our model using vector-quantized predictions of future states conditioned on current actions. A key property of our approach is that the model is pre-trained without any explicit reward signal.
arXiv Detail & Related papers (2023-04-05T17:58:33Z)
Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation [172.15808300686584]
We describe an approach that learns the two tasks simultaneously and exploits their intrinsic correlations to boost the training of each. Our approach improves the performance of various follower models and produces accurate navigation instructions.
arXiv Detail & Related papers (2022-03-30T18:15:26Z)
Waypoint Models for Instruction-guided Navigation in Continuous Environments [68.2912740006109]
We develop a class of language-conditioned waypoint prediction networks to examine this question. We measure task performance and estimated execution time on a profiled LoCoBot robot. Our models outperform prior work in VLN-CE and set a new state-of-the-art on the public leaderboard.
arXiv Detail & Related papers (2021-10-05T17:55:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.