LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating
- URL: http://arxiv.org/abs/2512.09920v1
- Date: Wed, 10 Dec 2025 18:54:30 GMT
- Title: LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating
- Authors: Junting Chen, Yunchuan Li, Panfeng Jiang, Jiacheng Du, Zixuan Chen, Chenrui Tie, Jiajun Deng, Lin Shao,
- Abstract summary: We present LISN-Bench, the first simulation-based benchmark for language-instructed social navigation.<n>We propose Social-Nav-Modulator, a fast-slow hierarchical system where a VLM agent modulates costmaps and controller parameters.<n>Our method achieves an average success rate of 91.3%, which is greater than 63% than the most competitive baseline.
- Score: 47.62872797480247
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Towards human-robot coexistence, socially aware navigation is significant for mobile robots. Yet existing studies on this area focus mainly on path efficiency and pedestrian collision avoidance, which are essential but represent only a fraction of social navigation. Beyond these basics, robots must also comply with user instructions, aligning their actions to task goals and social norms expressed by humans. In this work, we present LISN-Bench, the first simulation-based benchmark for language-instructed social navigation. Built on Rosnav-Arena 3.0, it is the first standardized social navigation benchmark to incorporate instruction following and scene understanding across diverse contexts. To address this task, we further propose Social-Nav-Modulator, a fast-slow hierarchical system where a VLM agent modulates costmaps and controller parameters. Decoupling low-level action generation from the slower VLM loop reduces reliance on high-frequency VLM inference while improving dynamic avoidance and perception adaptability. Our method achieves an average success rate of 91.3%, which is greater than 63% than the most competitive baseline, with most of the improvements observed in challenging tasks such as following a person in a crowd and navigating while strictly avoiding instruction-forbidden regions. The project website is at: https://social-nav.github.io/LISN-project/
Related papers
- From Obstacles to Etiquette: Robot Social Navigation with VLM-Informed Path Selection [57.74400052368147]
This paper presents a social robot navigation framework that integrates geometric planning with contextual social reasoning.<n>The system first extracts obstacles and human dynamics to generate geometrically feasible candidate paths, then leverages a fine-tuned vision-language model (VLM) to evaluate these paths.<n>Experiments in four social navigation contexts demonstrate that our method achieves the best overall performance with the lowest personal space violation duration, the minimal pedestrian-facing time, and no social zone intrusions.
arXiv Detail & Related papers (2026-02-09T18:46:12Z) - SocialNav-MoE: A Mixture-of-Experts Vision Language Model for Socially Compliant Navigation with Reinforcement Fine-Tuning [6.245382633570723]
Socially compliant navigation that accounts for human comfort, social norms, and contextual appropriateness remains underexplored.<n>We propose SocialNav-MoE, an efficient Mixture-of-Experts vision language model for socially compliant navigation with reinforcement fine-tuning.<n>Experiments on the SNEI dataset demonstrate that SocialNav-MoE achieves an excellent balance between navigation accuracy and efficiency.
arXiv Detail & Related papers (2025-12-15T14:21:15Z) - SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation [15.585324177543605]
Embodied navigation that adheres to social norms remains an open research challenge.<n>SocialNav is a foundational model for socially-aware navigation with a hierarchical "brain-action" architecture.<n>SocialNav achieves +38% success rate and +46% social compliance rate compared to the state-of-the-art method.
arXiv Detail & Related papers (2025-11-26T07:36:01Z) - SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation [32.75496547879437]
Social navigation in dynamic, human-centered environments requires socially-compliant decisions grounded in robust scene understanding.<n>Recent Vision-Language Models (VLMs) exhibit promising capabilities that align with the nuanced requirements of social robot navigation.<n>We introduce the Social Navigation Scene Understanding Benchmark (SocialNav-SUB), a dataset and benchmark designed to evaluate VLMs for scene understanding.
arXiv Detail & Related papers (2025-09-10T16:47:00Z) - DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation [55.888688171010365]
DORAEMON is a cognitive-inspired framework consisting of Ventral and Dorsal Streams that mimics human navigation capabilities.<n>We evaluate DORAEMON on the HM3D, MP3D and GOAT datasets, where it achieves state-of-the-art performance on both success rate (SR) and success weighted by path length (SPL) metrics.
arXiv Detail & Related papers (2025-05-28T04:46:13Z) - Hyp2Nav: Hyperbolic Planning and Curiosity for Crowd Navigation [58.574464340559466]
We advocate for hyperbolic learning to enable crowd navigation and we introduce Hyp2Nav.
Hyp2Nav leverages the intrinsic properties of hyperbolic geometry to better encode the hierarchical nature of decision-making processes in navigation tasks.
We propose a hyperbolic policy model and a hyperbolic curiosity module that results in effective social navigation, best success rates, and returns across multiple simulation settings.
arXiv Detail & Related papers (2024-07-18T14:40:33Z) - Unifying Large Language Model and Deep Reinforcement Learning for Human-in-Loop Interactive Socially-aware Navigation [16.789333617628138]
Social robot navigation planners face two major challenges: managing real-time user inputs and ensuring socially compliant behaviors.<n>We introduce SALM, an interactive, human-in-loop Socially-Aware navigation Large Language Model framework.<n>A memory mechanism archives temporal data for continuous refinement, while a multi-step graph-of-thoughts inference-based large language feedback model adaptively fuses the strengths of both planning approaches.
arXiv Detail & Related papers (2024-03-22T23:12:28Z) - NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning [97.88246428240872]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.<n>Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.<n>This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - Principles and Guidelines for Evaluating Social Robot Navigation
Algorithms [44.51586279645062]
Social robot navigation is difficult to evaluate because it involves dynamic human agents and their perceptions of the appropriateness of robot behavior.
Our contributions include (a) a definition of a socially navigating robot as one that respects the principles of safety, comfort, legibility, politeness, social competency, agent understanding, proactivity, and responsiveness to context, (b) guidelines for the use of metrics, development of scenarios, benchmarks, datasets, and simulators to evaluate social navigation, and (c) a social navigation metrics framework to make it easier to compare results from different simulators, robots and datasets.
arXiv Detail & Related papers (2023-06-29T07:31:43Z) - Socially Compliant Navigation Dataset (SCAND): A Large-Scale Dataset of
Demonstrations for Social Navigation [92.66286342108934]
Social navigation is the capability of an autonomous agent, such as a robot, to navigate in a'socially compliant' manner in the presence of other intelligent agents such as humans.
Our dataset contains 8.7 hours, 138 trajectories, 25 miles of socially compliant, human teleoperated driving demonstrations.
arXiv Detail & Related papers (2022-03-28T19:09:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.