Fugu-MT 論文翻訳(概要): Exploring the Use of VLMs for Navigation Assistance for People with Blindness and Low Vision

論文の概要: Exploring the Use of VLMs for Navigation Assistance for People with Blindness and Low Vision

arxiv url: http://arxiv.org/abs/2603.15624v1
Date: Mon, 26 Jan 2026 23:45:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:42.323835
Title: Exploring the Use of VLMs for Navigation Assistance for People with Blindness and Low Vision
Title（参考訳）: 失明・視力低下者のためのナビゲーション支援用VLMの探索
Authors: Yu Li, Yuchen Zheng, Giles Hamilton-Fletcher, Marco Mezzavilla, Yao Wang, Sundeep Rangan, Maurizio Porfiri, Zhou Yu, John-Ross Rizzo,
Abstract要約: 本稿では,視覚障害者のナビゲーション作業における視覚言語モデル(VLM)の有用性について検討する。我々は,GPT-4V,GPT-4o,Gemini-1.5-Pro,Claude-3.5-Sonnetなどのクローズソースモデルを,Llava-v1.6-mistralやLlava-onevision-qwenなどのオープンソースモデルとともに評価した。
参考スコア（独自算出の注目度）: 25.11164612463911
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper investigates the potential of vision-language models (VLMs) to assist people with blindness and low vision (pBLV) in navigation tasks. We evaluate state-of-the-art closed-source models, including GPT-4V, GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet, alongside open-source models, such as Llava-v1.6-mistral and Llava-onevision-qwen, to analyze their capabilities in foundational visual skills: counting ambient obstacles, relative spatial reasoning, and common-sense wayfinding-pertinent scene understanding. We further assess their performance in navigation scenarios, using pBLV-specific prompts designed to simulate real-world assistance tasks. Our findings reveal notable performance disparities between these models: GPT-4o consistently outperforms others across all tasks, particularly in spatial reasoning and scene understanding. In contrast, open-source models struggle with nuanced reasoning and adaptability in complex environments. Common challenges include difficulties in accurately counting objects in cluttered settings, biases in spatial reasoning, and a tendency to prioritize object details over spatial feedback, limiting their usability for pBLV in navigation tasks. Despite these limitations, VLMs show promise for wayfinding assistance when better aligned with human feedback and equipped with improved spatial reasoning. This research provides actionable insights into the strengths and limitations of current VLMs, guiding developers on effectively integrating VLMs into assistive technologies while addressing key limitations for enhanced usability.
Abstract（参考訳）: 本稿では,視覚障害者のナビゲーション作業における視覚言語モデル(VLM)の有用性について検討する。 GPT-4V, GPT-4o, Gemini-1.5-Pro, Claude-3.5-Sonnetなどの最先端のクローズドソースモデルとLlava-v1.6-mistral, Llava-onevision-qwenなどのオープンソースモデルを併用して, 周辺障害物の計数, 相対空間推論, 一般の視覚的シーン理解などの基礎的視覚スキルの能力を解析した。実世界の支援タスクをシミュレートするために設計されたpBLV固有のプロンプトを用いて、ナビゲーションシナリオにおけるそれらのパフォーマンスをさらに評価する。 GPT-4oは、特に空間的推論やシーン理解において、全てのタスクにおいて、常に他のタスクよりも優れています。対照的に、オープンソースのモデルは複雑な環境でのニュアンスな推論と適応性に苦しむ。一般的な課題としては、乱雑な設定でオブジェクトを正確にカウントすることの難しさ、空間的推論におけるバイアス、空間的フィードバックよりもオブジェクトの詳細を優先する傾向、ナビゲーションタスクにおけるpBLVの使用性を制限することなどがある。これらの制限にもかかわらず、VLMは人間のフィードバックに適合し、空間的推論を改善した場合、ウェイフィニング支援を約束する。この研究は、現在のVLMの強みと限界に関する実用的な洞察を提供し、開発者がVLMを補助技術に効果的に統合し、ユーザビリティを高めるための重要な制限に対処することを導く。

論文の概要: Exploring the Use of VLMs for Navigation Assistance for People with Blindness and Low Vision

関連論文リスト