SLAM-Free Visual Navigation with Hierarchical Vision-Language Perception and Coarse-to-Fine Semantic Topological Planning
- URL: http://arxiv.org/abs/2509.20739v1
- Date: Thu, 25 Sep 2025 04:38:45 GMT
- Title: SLAM-Free Visual Navigation with Hierarchical Vision-Language Perception and Coarse-to-Fine Semantic Topological Planning
- Authors: Guoyang Zhao, Yudong Li, Weiqing Qi, Kai Zhang, Bonan Liu, Kai Chen, Haoang Li, Jun Ma,
- Abstract summary: We propose a vision-only, SLAM-free navigation framework for legged robot navigation.<n>A hierarchical vision-language perception module fuses scene-level context with object-level cues for robust semantic inference.<n> integrated with reinforcement-learning controllers, the framework is deployable across diverse legged robot platforms.
- Score: 20.12642476619467
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conventional SLAM pipelines for legged robot navigation are fragile under rapid motion, calibration demands, and sensor drift, while offering limited semantic reasoning for task-driven exploration. To deal with these issues, we propose a vision-only, SLAM-free navigation framework that replaces dense geometry with semantic reasoning and lightweight topological representations. A hierarchical vision-language perception module fuses scene-level context with object-level cues for robust semantic inference. And a semantic-probabilistic topological map supports coarse-to-fine planning: LLM-based global reasoning for subgoal selection and vision-based local planning for obstacle avoidance. Integrated with reinforcement-learning locomotion controllers, the framework is deployable across diverse legged robot platforms. Experiments in simulation and real-world settings demonstrate consistent improvements in semantic accuracy, planning quality, and navigation success, while ablation studies further showcase the necessity of both hierarchical perception and fine local planning. This work introduces a new paradigm for SLAM-free, vision-language-driven navigation, shifting robotic exploration from geometry-centric mapping to semantics-driven decision making.
Related papers
- OpenFrontier: General Navigation with Visual-Language Grounded Frontiers [54.661157616245966]
Open-world navigation requires robots to make decisions in complex everyday environments.<n>Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language.<n>We propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models.
arXiv Detail & Related papers (2026-03-05T17:02:22Z) - To Move or Not to Move: Constraint-based Planning Enables Zero-Shot Generalization for Interactive Navigation [14.745622942938532]
In real-world scenarios, such as home environments and warehouses, clutter can block all routes.<n>We introduce the Lifelong Interactive Navigation problem, where a mobile robot can move clutter to forge its own path.<n>We propose an LLM-driven, constraint-based planning framework with active perception.
arXiv Detail & Related papers (2026-02-23T17:10:00Z) - AirHunt: Bridging VLM Semantics and Continuous Planning for Efficient Aerial Object Navigation [13.973823761671673]
AirHunt is an aerial object navigation system that efficiently locates open-set objects with zero-shot generalization in outdoor environments.<n>AirHunt features a dual-pathway asynchronous architecture that establishes a synergistic interface between VLM semantic reasoning and path planning.<n>We evaluate AirHunt across diverse object navigation tasks and environments, demonstrating a higher success rate with lower navigation error and reduced flight time.
arXiv Detail & Related papers (2026-01-19T05:50:03Z) - Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation [70.8381970762877]
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in semantic reasoning and task planning.<n>We introduce GRACE, a novel framework that grounds VLM-based reasoning through executable analytic concepts.<n>G GRACE provides a unified and interpretable interface between high-level instruction understanding and low-level robot control.
arXiv Detail & Related papers (2025-10-09T09:08:33Z) - A Navigation Framework Utilizing Vision-Language Models [0.0]
Vision-and-Language Navigation (VLN) presents a complex challenge in embodied AI.<n>Recent advances in large vision-language models (LVLMs), such as CLIP and Flamingo, have significantly improved multimodal understanding.<n>We propose a modular, plug-and-play navigation framework that decouples vision-language understanding from action planning.
arXiv Detail & Related papers (2025-06-11T20:51:58Z) - Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System [7.266794815157721]
We propose a hierarchical framework integrating a prompted Large Language Model (LLM) and a fine-tuned Vision Language Model (VLM)<n>The LLM decomposes tasks and constructs a global semantic map, while the VLM extracts task-specified semantic labels and 2D spatial information from aerial images to support local planning.<n>This is the first demonstration of an aerial-ground heterogeneous system integrating VLM-based perception with LLM-driven task reasoning and motion planning.
arXiv Detail & Related papers (2025-06-05T13:27:41Z) - Semantic Exploration and Dense Mapping of Complex Environments using Ground Robot with Panoramic LiDAR-Camera Fusion [10.438142938687326]
This paper presents a system for autonomous semantic exploration and dense semantic target mapping of a complex unknown environment using a ground robot equipped with a LiDAR-panoramic camera suite.<n>We first redefine the task as completing both geometric coverage and semantic viewpoint observation. We then manage semantic and geometric viewpoints separately and propose a novel Priority-driven Decoupled Local Sampler to generate local viewpoint sets.<n>In addition, we propose a Safe Aggressive Exploration State Machine, which allows aggressive exploration behavior while ensuring the robot's safety.
arXiv Detail & Related papers (2025-05-28T21:27:32Z) - Navigating Motion Agents in Dynamic and Cluttered Environments through LLM Reasoning [69.5875073447454]
This paper advances motion agents empowered by large language models (LLMs) toward autonomous navigation in dynamic and cluttered environments.<n>Our training-free framework supports multi-agent coordination, closed-loop replanning, and dynamic obstacle avoidance without retraining or fine-tuning.
arXiv Detail & Related papers (2025-03-10T13:39:09Z) - Navigation with Large Language Models: Semantic Guesswork as a Heuristic
for Planning [73.0990339667978]
Navigation in unfamiliar environments presents a major challenge for robots.
We use language models to bias exploration of novel real-world environments.
We evaluate LFG in challenging real-world environments and simulated benchmarks.
arXiv Detail & Related papers (2023-10-16T06:21:06Z) - How To Not Train Your Dragon: Training-free Embodied Object Goal
Navigation with Semantic Frontiers [94.46825166907831]
We present a training-free solution to tackle the object goal navigation problem in Embodied AI.
Our method builds a structured scene representation based on the classic visual simultaneous localization and mapping (V-SLAM) framework.
Our method propagates semantics on the scene graphs based on language priors and scene statistics to introduce semantic knowledge to the geometric frontiers.
arXiv Detail & Related papers (2023-05-26T13:38:33Z) - Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration
for Zero-Shot Object Navigation [58.3480730643517]
We present LGX, a novel algorithm for Language-Driven Zero-Shot Object Goal Navigation (L-ZSON)
Our approach makes use of Large Language Models (LLMs) for this task.
We achieve state-of-the-art zero-shot object navigation results on RoboTHOR with a success rate (SR) improvement of over 27% over the current baseline.
arXiv Detail & Related papers (2023-03-06T20:19:19Z) - Think Global, Act Local: Dual-scale Graph Transformer for
Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z) - Object Goal Navigation using Goal-Oriented Semantic Exploration [98.14078233526476]
This work studies the problem of object goal navigation which involves navigating to an instance of the given object category in unseen environments.
We propose a modular system called, Goal-Oriented Semantic Exploration' which builds an episodic semantic map and uses it to explore the environment efficiently.
arXiv Detail & Related papers (2020-07-01T17:52:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.