Related papers: Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation

Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation

URL: http://arxiv.org/abs/2507.11001v1
Date: Tue, 15 Jul 2025 05:37:24 GMT
Title: Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation
Authors: Yanbo Wang, Zipeng Fang, Lei Zhao, Weidong Chen,
Abstract summary: We present LE-Nav, an interpretable and scene-aware navigation framework for service robots.<n>To achieve zero-shot scene understanding, we utilize one-shot exemplars and chain-of-thought prompting strategies.<n>Experiments show that LE-Nav can generate hyperparameters achieving human-level tuning across diverse planners and scenarios.
Score: 12.561993540768729
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Service robots are increasingly deployed in diverse and dynamic environments, where both physical layouts and social contexts change over time and across locations. In these unstructured settings, conventional navigation systems that rely on fixed parameters often fail to generalize across scenarios, resulting in degraded performance and reduced social acceptance. Although recent approaches have leveraged reinforcement learning to enhance traditional planners, these methods often fail in real-world deployments due to poor generalization and limited simulation diversity, which hampers effective sim-to-real transfer. To tackle these issues, we present LE-Nav, an interpretable and scene-aware navigation framework that leverages multi-modal large language model reasoning and conditional variational autoencoders to adaptively tune planner hyperparameters. To achieve zero-shot scene understanding, we utilize one-shot exemplars and chain-of-thought prompting strategies. Additionally, a conditional variational autoencoder captures the mapping between natural language instructions and navigation hyperparameters, enabling expert-level tuning. Experiments show that LE-Nav can generate hyperparameters achieving human-level tuning across diverse planners and scenarios. Real-world navigation trials and a user study on a smart wheelchair platform demonstrate that it outperforms state-of-the-art methods on quantitative metrics such as success rate, efficiency, safety, and comfort, while receiving higher subjective scores for perceived safety and social acceptance. Code is available at https://github.com/Cavendish518/LE-Nav.

Related papers

From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning [59.88543114325153]
We introduce the Seeing-to-Experiencing framework to scale the capability of navigation foundation models with reinforcement learning.<n>S2E combines the strengths of pre-training on videos and post-training through RL.<n>We establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3DGS reconstructions of real-world scenes.
arXiv Detail & Related papers (2025-07-29T17:26:10Z)
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
Ground-level Viewpoint Vision-and-Language Navigation in Continuous Environments [10.953629652228024]
Vision-and-Language Navigation (VLN) agents associate time-sequenced visual observations with corresponding instructions to make decisions.<n>In this paper, we address the mismatch between human-centric instructions and quadruped robots with a low-height field of view.<n>We propose a Ground-level Viewpoint Navigation (GVNav) approach to mitigate this issue.
arXiv Detail & Related papers (2025-02-26T10:30:40Z)
OpenObject-NAV: Open-Vocabulary Object-Oriented Navigation Based on Dynamic Carrier-Relationship Scene Graph [10.475404599532157]
This paper captures the relationships between frequently used objects and their static carriers. We propose an instance navigation strategy that models the navigation process as a Markov Decision Process. The results demonstrate that by updating the CRSG, the robot can efficiently navigate to moved targets.
arXiv Detail & Related papers (2024-09-27T13:33:52Z)
PLANRL: A Motion Planning and Imitation Learning Framework to Bootstrap Reinforcement Learning [13.564676246832544]
We introduce PLANRL, a framework that chooses when the robot should use classical motion planning and when it should learn a policy. PLANRL switches between two modes of operation: reaching a waypoint using classical techniques when away from the objects and fine-grained manipulation control when about to interact with objects. We evaluate our approach across multiple challenging simulation environments and real-world tasks, demonstrating superior performance in terms of adaptability, efficiency, and generalization compared to existing methods.
arXiv Detail & Related papers (2024-08-07T19:30:08Z)
IN-Sight: Interactive Navigation through Sight [20.184155117341497]
IN-Sight is a novel approach to self-supervised path planning. It calculates traversability scores and incorporates them into a semantic map. To precisely navigate around obstacles, IN-Sight employs a local planner.
arXiv Detail & Related papers (2024-08-01T07:27:54Z)
Hyp2Nav: Hyperbolic Planning and Curiosity for Crowd Navigation [58.574464340559466]
We advocate for hyperbolic learning to enable crowd navigation and we introduce Hyp2Nav. Hyp2Nav leverages the intrinsic properties of hyperbolic geometry to better encode the hierarchical nature of decision-making processes in navigation tasks. We propose a hyperbolic policy model and a hyperbolic curiosity module that results in effective social navigation, best success rates, and returns across multiple simulation settings.
arXiv Detail & Related papers (2024-07-18T14:40:33Z)
Unifying Large Language Model and Deep Reinforcement Learning for Human-in-Loop Interactive Socially-aware Navigation [16.789333617628138]
Social robot navigation planners face two major challenges: managing real-time user inputs and ensuring socially compliant behaviors.<n>We introduce SALM, an interactive, human-in-loop Socially-Aware navigation Large Language Model framework.<n>A memory mechanism archives temporal data for continuous refinement, while a multi-step graph-of-thoughts inference-based large language feedback model adaptively fuses the strengths of both planning approaches.
arXiv Detail & Related papers (2024-03-22T23:12:28Z)
Fast-Slow Test-Time Adaptation for Online Vision-and-Language Navigation [67.18144414660681]
We propose a Fast-Slow Test-Time Adaptation (FSTTA) approach for online Vision-and-Language Navigation (VLN) Our method obtains impressive performance gains on four popular benchmarks.
arXiv Detail & Related papers (2023-11-22T07:47:39Z)
NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration [57.15811390835294]
This paper describes how we can train a single unified diffusion policy to handle both goal-directed navigation and goal-agnostic exploration. We show that this unified policy results in better overall performance when navigating to visually indicated goals in novel environments. Our experiments, conducted on a real-world mobile robot platform, show effective navigation in unseen environments in comparison with five alternative methods.
arXiv Detail & Related papers (2023-10-11T21:07:14Z)
CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation [73.78984332354636]
CorNav is a novel zero-shot framework for vision-and-language navigation. It incorporates environmental feedback for refining future plans and adjusting its actions. It consistently outperforms all baselines in a zero-shot multi-task setting.
arXiv Detail & Related papers (2023-06-17T11:44:04Z)
Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration [83.96729205383501]
We introduce prompt-based learning to achieve fast adaptation for language embeddings. Our model can adapt to diverse vision-language navigation tasks, including VLN and REVERIE.
arXiv Detail & Related papers (2022-03-08T11:01:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.