SocialNav-MoE: A Mixture-of-Experts Vision Language Model for Socially Compliant Navigation with Reinforcement Fine-Tuning
- URL: http://arxiv.org/abs/2512.14757v1
- Date: Mon, 15 Dec 2025 14:21:15 GMT
- Title: SocialNav-MoE: A Mixture-of-Experts Vision Language Model for Socially Compliant Navigation with Reinforcement Fine-Tuning
- Authors: Tomohito Kawabata, Xinyu Zhang, Ling Xiao,
- Abstract summary: Socially compliant navigation that accounts for human comfort, social norms, and contextual appropriateness remains underexplored.<n>We propose SocialNav-MoE, an efficient Mixture-of-Experts vision language model for socially compliant navigation with reinforcement fine-tuning.<n>Experiments on the SNEI dataset demonstrate that SocialNav-MoE achieves an excellent balance between navigation accuracy and efficiency.
- Score: 6.245382633570723
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For robots navigating in human-populated environments, safety and social compliance are equally critical, yet prior work has mostly emphasized safety. Socially compliant navigation that accounts for human comfort, social norms, and contextual appropriateness remains underexplored. Vision language models (VLMs) show promise for this task; however, large-scale models incur substantial computational overhead, leading to higher inference latency and energy consumption, which makes them unsuitable for real-time deployment on resource-constrained robotic platforms. To address this issue, we investigate the effectiveness of small VLM and propose SocialNav-MoE, an efficient Mixture-of-Experts vision language model for socially compliant navigation with reinforcement fine-tuning (RFT). We further introduce a semantic similarity reward (SSR) to effectively leverage RFT for enhancing the decision-making capabilities. Additionally, we study the effectiveness of different small language model types (Phi, Qwen, and StableLM), routing strategies, and vision encoders (CLIP vs. SigLIP, frozen vs. fine-tuned). Experiments on the SNEI dataset demonstrate that SocialNav-MoE achieves an excellent balance between navigation accuracy and efficiency. The proposed SSR function is more effective than hard-level and character-level rewards. Source code will be released upon acceptance.
Related papers
- From Obstacles to Etiquette: Robot Social Navigation with VLM-Informed Path Selection [57.74400052368147]
This paper presents a social robot navigation framework that integrates geometric planning with contextual social reasoning.<n>The system first extracts obstacles and human dynamics to generate geometrically feasible candidate paths, then leverages a fine-tuned vision-language model (VLM) to evaluate these paths.<n>Experiments in four social navigation contexts demonstrate that our method achieves the best overall performance with the lowest personal space violation duration, the minimal pedestrian-facing time, and no social zone intrusions.
arXiv Detail & Related papers (2026-02-09T18:46:12Z) - LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating [47.62872797480247]
We present LISN-Bench, the first simulation-based benchmark for language-instructed social navigation.<n>We propose Social-Nav-Modulator, a fast-slow hierarchical system where a VLM agent modulates costmaps and controller parameters.<n>Our method achieves an average success rate of 91.3%, which is greater than 63% than the most competitive baseline.
arXiv Detail & Related papers (2025-12-10T18:54:30Z) - SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation [32.75496547879437]
Social navigation in dynamic, human-centered environments requires socially-compliant decisions grounded in robust scene understanding.<n>Recent Vision-Language Models (VLMs) exhibit promising capabilities that align with the nuanced requirements of social robot navigation.<n>We introduce the Social Navigation Scene Understanding Benchmark (SocialNav-SUB), a dataset and benchmark designed to evaluate VLMs for scene understanding.
arXiv Detail & Related papers (2025-09-10T16:47:00Z) - Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation [12.561993540768729]
We present LE-Nav, an interpretable and scene-aware navigation framework for service robots.<n>To achieve zero-shot scene understanding, we utilize one-shot exemplars and chain-of-thought prompting strategies.<n>Experiments show that LE-Nav can generate hyperparameters achieving human-level tuning across diverse planners and scenarios.
arXiv Detail & Related papers (2025-07-15T05:37:24Z) - SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving [51.47621083057114]
SOLVE is an innovative framework that synergizes Vision-Language Models with end-to-end (E2E) models to enhance autonomous vehicle planning.<n>Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components.
arXiv Detail & Related papers (2025-05-22T15:44:30Z) - Unifying Large Language Model and Deep Reinforcement Learning for Human-in-Loop Interactive Socially-aware Navigation [16.789333617628138]
Social robot navigation planners face two major challenges: managing real-time user inputs and ensuring socially compliant behaviors.<n>We introduce SALM, an interactive, human-in-loop Socially-Aware navigation Large Language Model framework.<n>A memory mechanism archives temporal data for continuous refinement, while a multi-step graph-of-thoughts inference-based large language feedback model adaptively fuses the strengths of both planning approaches.
arXiv Detail & Related papers (2024-03-22T23:12:28Z) - NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning [97.88246428240872]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.<n>Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.<n>This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - Multi-Agent Dynamic Relational Reasoning for Social Robot Navigation [50.01551945190676]
Social robot navigation can be helpful in various contexts of daily life but requires safe human-robot interactions and efficient trajectory planning.
We propose a systematic relational reasoning approach with explicit inference of the underlying dynamically evolving relational structures.
We demonstrate its effectiveness for multi-agent trajectory prediction and social robot navigation.
arXiv Detail & Related papers (2024-01-22T18:58:22Z) - REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.<n>Recent methods aim to mitigate misalignment by learning reward functions from human preferences.<n>We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - SocNavGym: A Reinforcement Learning Gym for Social Navigation [0.0]
SocNavGym is an advanced simulation environment for social navigation.
It can generate different types of social navigation scenarios.
It can also be configured to work with different hand-crafted and data-driven social reward signals.
arXiv Detail & Related papers (2023-04-27T11:29:02Z) - Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling [65.99956848461915]
Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal.<n>One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments.<n>We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data.
arXiv Detail & Related papers (2019-11-17T18:02:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.