Walk With Me: Long-Horizon Social Navigation for Human-Centric Outdoor Assistance
Abstract Overview
This paper presents Walk with Me, a hierarchical framework for long-horizon outdoor social navigation from high-level human instructions without relying on a pre-built HD map. The system uses public map-service priors—GPS context, candidate points of interest, and walking-route APIs—to ground abstract user intent into a concrete destination and a coarse waypoint sequence. During execution, a High-Level Vision-Language Model jointly assesses whether the current situation is routine or safety-critical and decides whether the robot should proceed or stop and wait, while a Low-Level Vision-Language-Action policy generates local socially compliant trajectories for proceed steps. The method is instantiated on an Athena 2.0 Pro AGV wheeled robot and evaluated in real-world outdoor assistance settings including last-mile delivery and blind guidance across 20 trials.
Novelty
The main contribution is a map-free outdoor social navigation framework that integrates natural-language intent grounding via public map-service POIs, long-horizon waypoint construction, and an observation-aware routing mechanism that adaptively switches between low-level VLA control and explicit high-level VLM safety reasoning with stop-and-wait behavior. The paper also unifies destination grounding, coarse route planning, and socially aware execution under a single closed-loop hierarchy for human-centric outdoor assistance.
Results
In 20 real-world trials, the full system completed 12, yielding an overall success rate of 60%. Last-mile delivery achieved 70% success over 10 trials while blind guidance achieved 50% over 10 trials, with the latter being harder due to more open-ended intent grounding and conservative behavior in socially sensitive scenes. Ablation studies on the two delivery scenarios show that replacing the Low-Level VLA (e.g., GNM at 20% vs. SocialNav at 60%) and the High-Level VLM (e.g., Qwen3-VL-8B at 30% vs. MiMo-Embodied at 60%) both materially affect end-to-end success.
Key Points
- Walk with Me grounds abstract human instructions into concrete outdoor destinations using GPS context, POI candidates, and walking-route APIs from public map services, eliminating the need for a pre-built HD map.
- The framework employs an observation-aware routing mechanism where a High-Level VLM jointly assesses scene complexity and safety at each control step, dispatching routine segments to a Low-Level VLA for socially compliant trajectory generation and triggering stop-and-wait behavior when conditions are unsafe.
- Real-world experiments across 20 trials on delivery and blind-guidance scenarios demonstrate kilometer-scale outdoor execution with 60% overall success, and ablations on delivery tasks show clear performance differences across VLM and VLA backbone choices, with socially aware policies (SocialNav) and navigation-oriented VLMs (MiMo-Embodied) achieving the highest success rates.