Fugu-MT 論文翻訳(概要): Urban Risk-Aware Navigation via VQA-Based Event Maps for People with Low Vision

論文の概要: Urban Risk-Aware Navigation via VQA-Based Event Maps for People with Low Vision

arxiv url: http://arxiv.org/abs/2605.11782v1
Date: Tue, 12 May 2026 08:50:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.72921
Title: Urban Risk-Aware Navigation via VQA-Based Event Maps for People with Low Vision
Title（参考訳）: 低視力者のためのVQAに基づくイベントマップによる都市リスク認識ナビゲーション
Authors: Antoni Valls, Jordi Sanchez-Riera,
Abstract要約: 視覚障害は世界中の何億人もの人々に影響を与え、都市環境の安全と独立性を著しく制限する。本稿では,視覚言語モデル(VLM)を利用した視覚的質問応答に基づくイベントマップフレームワークを提案する。我々は、VQAアーキテクチャであるViLT、LLaVA、InstructBLIP、Qwen-VLの4つをベンチマークし、生成型マルチモーダル大言語モデル(MLLM)が、分類に基づくアプローチよりも大幅に優れていることを発見した。
参考スコア（独自算出の注目度）: 0.5371337604556311
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual impairment affects hundreds of millions of people worldwide, severely limiting their ability to navigate urban environments safely and independently. While wearable assistive devices offer a promising platform for real-time hazard detection, existing approaches rely on task-specific vision pipelines that lack flexibility and generalizability. In this work, we propose an event map framework based on visual question answering that leverages Vision-Language Models (VLMs) for pedestrian scene description and hazard identification across diverse real-world environments, using a three-level hierarchical query structure to enable fine-grained scene understanding without task-specific retraining. Model responses are aggregated into a weighted risk scoring system that maps street segments into four discrete safety categories, producing navigable risk-aware event maps for route planning. To support evaluation and future research, we introduce a geographically diverse dataset spanning 20 cities across six continents, comprising over 800 annotated images and 18,000 answered questions. We benchmark four VQA architectures -ViLT, LLaVA, InstructBLIP, and Qwen-VL- and find that generative Multimodal Large Language Models (MLLMs) substantially outperform classification-based approaches, with Qwen-VL achieving the best overall balance of precision and recall. These results demonstrate the viability of MLLMs as a flexible and generalizable foundation for assistive navigation systems for visually impaired people.
Abstract（参考訳）: 視覚障害は世界中の何億人もの人々に影響を与え、都市環境の安全と独立性を著しく制限する。ウェアラブルアシストデバイスは、リアルタイムのハザード検出のための有望なプラットフォームを提供するが、既存のアプローチは、柔軟性と一般化性に欠けるタスク固有のビジョンパイプラインに依存している。本研究では,視覚的質問応答に基づくイベントマップフレームワークを提案する。このフレームワークは,3階層の階層的クエリ構造を用いて,視覚空間モデル(VLM)を用いて,視覚的なシーン記述と危険識別を行う。モデル応答は重み付けされたリスクスコアシステムに集約され、道路セグメントを4つの個別の安全カテゴリにマップし、経路計画のためのナビゲート可能なリスク対応イベントマップを生成する。評価と今後の研究を支援するため、6大陸にまたがる20都市にまたがる地理的に多様なデータセットを導入し、800以上の注釈付き画像と18,000の回答付き質問からなる。我々は、VQAアーキテクチャのViLT、LLaVA、InstructBLIP、Qwen-VLの4つをベンチマークし、生成型マルチモーダル大言語モデル(MLLM)が、分類に基づくアプローチを大幅に上回っており、Qwen-VLは精度とリコールの全体的なバランスを最高のものにしていることを示す。これらの結果は、視覚障害者のための補助ナビゲーションシステムのためのフレキシブルで一般化可能な基盤としてMLLMが実現可能であることを示す。

論文の概要: Urban Risk-Aware Navigation via VQA-Based Event Maps for People with Low Vision

関連論文リスト