Related papers: mmWalk: Towards Multi-modal Multi-view Walking Assistance

mmWalk: Towards Multi-modal Multi-view Walking Assistance

URL: http://arxiv.org/abs/2510.11520v2
Date: Thu, 23 Oct 2025 16:40:49 GMT
Title: mmWalk: Towards Multi-modal Multi-view Walking Assistance
Authors: Kedi Ying, Ruiping Liu, Chongyan Chen, Mingzhe Tao, Hao Shi, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen,
Abstract summary: mmWalk is a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation.<n>Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames.<n>We generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance.
Score: 44.184803877778556
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames. It contains over 559k panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.

Related papers

City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning [18.827215649935468]
We build the first multidomain perception outdoor scene understanding dataset, named textbfunderlineSVM-City.<n>It contains $420$k images and $4, 811$M point clouds with $567$k question-answering pairs from vehicles, low-altitude drones, high-altitude aerial planes, and satellite.<n> Experimental results show City-VLM achieves $18.14 %$ performance surpassing existing LVLMs in question-answering tasks averagely.
arXiv Detail & Related papers (2025-07-17T05:21:21Z)
Learning to Drive Anywhere with Model-Based Reannotation [49.80796496905606]
We develop a framework for generalizable visual navigation policies for robots.<n>We leverage passively collected data, including crowd-sourced teleoperation data and unlabeled YouTube videos.<n>This relabeled data is then distilled into LogoNav, a long-horizon navigation policy conditioned on visual goals or GPS waypoints.
arXiv Detail & Related papers (2025-05-08T18:43:39Z)
TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance [48.12326709517022]
We present TraveLLaMA, a specialized multimodal language model designed for urban scene understanding and travel assistance.<n>Our work addresses the fundamental challenge of developing practical AI travel assistants through a novel large-scale dataset of 220k question-answer pairs.
arXiv Detail & Related papers (2025-04-23T08:32:25Z)
Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding [10.242043337117005]
Vision Large Language Models (VLLMs) have demonstrated impressive capabilities in general visual tasks such as image captioning and visual question answering.<n>However, their effectiveness in specialized, safety-critical domains like autonomous driving remains largely unexplored.<n>We introduce DVBench, a pioneering benchmark designed to evaluate the performance of VLLMs in understanding safety-critical driving videos.
arXiv Detail & Related papers (2025-04-20T07:50:44Z)
GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance [18.467461615621872]
Mobility remains a significant challenge for the 2.2 billion people worldwide affected by blindness and low vision (BLV)<n>We introduce GuideDog, a novel accessibility-aware guide dataset containing 22K image-description pairs.<n>We also develop GuideDogQA, a subset of 818 samples featuring multiple-choice questions designed to evaluate fine-grained visual perception capabilities.
arXiv Detail & Related papers (2025-03-17T05:43:40Z)
NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models [11.184459657989914]
We introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding.<n>We also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs.<n>Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives.
arXiv Detail & Related papers (2025-03-17T03:12:39Z)
DivScene: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes [76.24687327731031]
We first study the challenge of open-vocabulary object navigation by introducing DivScene.<n>Our dataset provides a much greater diversity of target objects and scene types than existing datasets.<n>We fine-tuned LVLMs to predict the next action with CoT explanations.
arXiv Detail & Related papers (2024-10-03T17:49:28Z)
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.<n>Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.<n>We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z)
RSRD: A Road Surface Reconstruction Dataset and Benchmark for Safe and Comfortable Autonomous Driving [67.09546127265034]
Road surface reconstruction helps to enhance the analysis and prediction of vehicle responses for motion planning and control systems. We introduce the Road Surface Reconstruction dataset, a real-world, high-resolution, and high-precision dataset collected with a specialized platform in diverse driving conditions. It covers common road types containing approximately 16,000 pairs of stereo images, original point clouds, and ground-truth depth/disparity maps.
arXiv Detail & Related papers (2023-10-03T17:59:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.