mmWalk: Towards Multi-modal Multi-view Walking Assistance
- URL: http://arxiv.org/abs/2510.11520v2
- Date: Thu, 23 Oct 2025 16:40:49 GMT
- Title: mmWalk: Towards Multi-modal Multi-view Walking Assistance
- Authors: Kedi Ying, Ruiping Liu, Chongyan Chen, Mingzhe Tao, Hao Shi, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen,
- Abstract summary: mmWalk is a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation.<n>Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames.<n>We generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance.
- Score: 44.184803877778556
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames. It contains over 559k panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.
Related papers
- City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning [18.827215649935468]
We build the first multidomain perception outdoor scene understanding dataset, named textbfunderlineSVM-City.<n>It contains $420$k images and $4, 811$M point clouds with $567$k question-answering pairs from vehicles, low-altitude drones, high-altitude aerial planes, and satellite.<n> Experimental results show City-VLM achieves $18.14 %$ performance surpassing existing LVLMs in question-answering tasks averagely.
arXiv Detail & Related papers (2025-07-17T05:21:21Z) - Learning to Drive Anywhere with Model-Based Reannotation [49.80796496905606]
We develop a framework for generalizable visual navigation policies for robots.<n>We leverage passively collected data, including crowd-sourced teleoperation data and unlabeled YouTube videos.<n>This relabeled data is then distilled into LogoNav, a long-horizon navigation policy conditioned on visual goals or GPS waypoints.
arXiv Detail & Related papers (2025-05-08T18:43:39Z) - TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance [48.12326709517022]
We present TraveLLaMA, a specialized multimodal language model designed for urban scene understanding and travel assistance.<n>Our work addresses the fundamental challenge of developing practical AI travel assistants through a novel large-scale dataset of 220k question-answer pairs.
arXiv Detail & Related papers (2025-04-23T08:32:25Z) - Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding [10.242043337117005]
Vision Large Language Models (VLLMs) have demonstrated impressive capabilities in general visual tasks such as image captioning and visual question answering.<n>However, their effectiveness in specialized, safety-critical domains like autonomous driving remains largely unexplored.<n>We introduce DVBench, a pioneering benchmark designed to evaluate the performance of VLLMs in understanding safety-critical driving videos.
arXiv Detail & Related papers (2025-04-20T07:50:44Z) - GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance [18.467461615621872]
Mobility remains a significant challenge for the 2.2 billion people worldwide affected by blindness and low vision (BLV)<n>We introduce GuideDog, a novel accessibility-aware guide dataset containing 22K image-description pairs.<n>We also develop GuideDogQA, a subset of 818 samples featuring multiple-choice questions designed to evaluate fine-grained visual perception capabilities.
arXiv Detail & Related papers (2025-03-17T05:43:40Z) - NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models [11.184459657989914]
We introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding.<n>We also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs.<n>Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives.
arXiv Detail & Related papers (2025-03-17T03:12:39Z) - DivScene: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes [76.24687327731031]
We first study the challenge of open-vocabulary object navigation by introducing DivScene.<n>Our dataset provides a much greater diversity of target objects and scene types than existing datasets.<n>We fine-tuned LVLMs to predict the next action with CoT explanations.
arXiv Detail & Related papers (2024-10-03T17:49:28Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.<n>Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.<n>We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - RSRD: A Road Surface Reconstruction Dataset and Benchmark for Safe and
Comfortable Autonomous Driving [67.09546127265034]
Road surface reconstruction helps to enhance the analysis and prediction of vehicle responses for motion planning and control systems.
We introduce the Road Surface Reconstruction dataset, a real-world, high-resolution, and high-precision dataset collected with a specialized platform in diverse driving conditions.
It covers common road types containing approximately 16,000 pairs of stereo images, original point clouds, and ground-truth depth/disparity maps.
arXiv Detail & Related papers (2023-10-03T17:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.