SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation
- URL: http://arxiv.org/abs/2509.08757v1
- Date: Wed, 10 Sep 2025 16:47:00 GMT
- Title: SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation
- Authors: Michael J. Munje, Chen Tang, Shuijing Liu, Zichao Hu, Yifeng Zhu, Jiaxun Cui, Garrett Warnell, Joydeep Biswas, Peter Stone,
- Abstract summary: Social navigation in dynamic, human-centered environments requires socially-compliant decisions grounded in robust scene understanding.<n>Recent Vision-Language Models (VLMs) exhibit promising capabilities that align with the nuanced requirements of social robot navigation.<n>We introduce the Social Navigation Scene Understanding Benchmark (SocialNav-SUB), a dataset and benchmark designed to evaluate VLMs for scene understanding.
- Score: 32.75496547879437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Robot navigation in dynamic, human-centered environments requires socially-compliant decisions grounded in robust scene understanding. Recent Vision-Language Models (VLMs) exhibit promising capabilities such as object recognition, common-sense reasoning, and contextual understanding-capabilities that align with the nuanced requirements of social robot navigation. However, it remains unclear whether VLMs can accurately understand complex social navigation scenes (e.g., inferring the spatial-temporal relations among agents and human intentions), which is essential for safe and socially compliant robot navigation. While some recent works have explored the use of VLMs in social robot navigation, no existing work systematically evaluates their ability to meet these necessary conditions. In this paper, we introduce the Social Navigation Scene Understanding Benchmark (SocialNav-SUB), a Visual Question Answering (VQA) dataset and benchmark designed to evaluate VLMs for scene understanding in real-world social robot navigation scenarios. SocialNav-SUB provides a unified framework for evaluating VLMs against human and rule-based baselines across VQA tasks requiring spatial, spatiotemporal, and social reasoning in social robot navigation. Through experiments with state-of-the-art VLMs, we find that while the best-performing VLM achieves an encouraging probability of agreeing with human answers, it still underperforms simpler rule-based approach and human consensus baselines, indicating critical gaps in social scene understanding of current VLMs. Our benchmark sets the stage for further research on foundation models for social robot navigation, offering a framework to explore how VLMs can be tailored to meet real-world social robot navigation needs. An overview of this paper along with the code and data can be found at https://larg.github.io/socialnav-sub .
Related papers
- From Obstacles to Etiquette: Robot Social Navigation with VLM-Informed Path Selection [57.74400052368147]
This paper presents a social robot navigation framework that integrates geometric planning with contextual social reasoning.<n>The system first extracts obstacles and human dynamics to generate geometrically feasible candidate paths, then leverages a fine-tuned vision-language model (VLM) to evaluate these paths.<n>Experiments in four social navigation contexts demonstrate that our method achieves the best overall performance with the lowest personal space violation duration, the minimal pedestrian-facing time, and no social zone intrusions.
arXiv Detail & Related papers (2026-02-09T18:46:12Z) - SocialNav-MoE: A Mixture-of-Experts Vision Language Model for Socially Compliant Navigation with Reinforcement Fine-Tuning [6.245382633570723]
Socially compliant navigation that accounts for human comfort, social norms, and contextual appropriateness remains underexplored.<n>We propose SocialNav-MoE, an efficient Mixture-of-Experts vision language model for socially compliant navigation with reinforcement fine-tuning.<n>Experiments on the SNEI dataset demonstrate that SocialNav-MoE achieves an excellent balance between navigation accuracy and efficiency.
arXiv Detail & Related papers (2025-12-15T14:21:15Z) - LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating [47.62872797480247]
We present LISN-Bench, the first simulation-based benchmark for language-instructed social navigation.<n>We propose Social-Nav-Modulator, a fast-slow hierarchical system where a VLM agent modulates costmaps and controller parameters.<n>Our method achieves an average success rate of 91.3%, which is greater than 63% than the most competitive baseline.
arXiv Detail & Related papers (2025-12-10T18:54:30Z) - SocialEval: Evaluating Social Intelligence of Large Language Models [70.90981021629021]
Social Intelligence (SI) equips humans with interpersonal abilities to behave wisely in navigating social interactions to achieve social goals.<n>This presents an operational evaluation paradigm: outcome-oriented goal achievement evaluation and process-oriented interpersonal ability evaluation.<n>We propose SocialEval, a script-based bilingual SI benchmark, integrating outcome- and process-oriented evaluation by manually crafting narrative scripts.
arXiv Detail & Related papers (2025-06-01T08:36:51Z) - Social-LLaVA: Enhancing Robot Navigation through Human-Language Reasoning in Social Spaces [40.44502415484082]
We propose using language to bridge the gap in human-like reasoning between perception and robot actions.<n>We create a vision-language dataset, Social robot Navigation via Explainable Interactions (SNEI), featuring 40K human-annotated Visual Question Answers (VQAs)<n>We fine-tune a VLM, Social-LLaVA, using SNEI to demonstrate the practical application of our dataset.
arXiv Detail & Related papers (2024-12-30T23:59:30Z) - Principles and Guidelines for Evaluating Social Robot Navigation
Algorithms [44.51586279645062]
Social robot navigation is difficult to evaluate because it involves dynamic human agents and their perceptions of the appropriateness of robot behavior.
Our contributions include (a) a definition of a socially navigating robot as one that respects the principles of safety, comfort, legibility, politeness, social competency, agent understanding, proactivity, and responsiveness to context, (b) guidelines for the use of metrics, development of scenarios, benchmarks, datasets, and simulators to evaluate social navigation, and (c) a social navigation metrics framework to make it easier to compare results from different simulators, robots and datasets.
arXiv Detail & Related papers (2023-06-29T07:31:43Z) - Gesture2Path: Imitation Learning for Gesture-aware Navigation [54.570943577423094]
We present Gesture2Path, a novel social navigation approach that combines image-based imitation learning with model-predictive control.
We deploy our method on real robots and showcase the effectiveness of our approach for the four gestures-navigation scenarios.
arXiv Detail & Related papers (2022-09-19T23:05:36Z) - Socially Compliant Navigation Dataset (SCAND): A Large-Scale Dataset of
Demonstrations for Social Navigation [92.66286342108934]
Social navigation is the capability of an autonomous agent, such as a robot, to navigate in a'socially compliant' manner in the presence of other intelligent agents such as humans.
Our dataset contains 8.7 hours, 138 trajectories, 25 miles of socially compliant, human teleoperated driving demonstrations.
arXiv Detail & Related papers (2022-03-28T19:09:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.