GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance
- URL: http://arxiv.org/abs/2503.12844v1
- Date: Mon, 17 Mar 2025 05:43:40 GMT
- Title: GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance
- Authors: Junhyeok Kim, Jaewoo Park, Junhee Park, Sangeyl Lee, Jiwan Chung, Jisung Kim, Ji Hoon Joung, Youngjae Yu,
- Abstract summary: Mobility remains a significant challenge for the 2.2 billion people worldwide affected by blindness and low vision (BLV)<n>We introduce GuideDog, a novel accessibility-aware guide dataset containing 22K image-description pairs.<n>We also develop GuideDogQA, a subset of 818 samples featuring multiple-choice questions designed to evaluate fine-grained visual perception capabilities.
- Score: 18.467461615621872
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mobility remains a significant challenge for the 2.2 billion people worldwide affected by blindness and low vision (BLV), with 7% of visually impaired individuals experiencing falls at least once a month. While recent advances in Multimodal Large Language Models (MLLMs) offer promising opportunities for BLV assistance, their development has been hindered by limited datasets. This limitation stems from the fact that BLV-aware annotation requires specialized domain knowledge and intensive labor. To address this gap, we introduce GuideDog, a novel accessibility-aware guide dataset containing 22K image-description pairs (including 2K human-annotated pairs) that capture diverse real-world scenes from a pedestrian's viewpoint. Our approach shifts the annotation burden from generation to verification through a collaborative human-AI framework grounded in established accessibility standards, significantly improving efficiency while maintaining high-quality annotations. We also develop GuideDogQA, a subset of 818 samples featuring multiple-choice questions designed to evaluate fine-grained visual perception capabilities, specifically object recognition and relative depth perception. Our experimental results highlight the importance of accurate spatial understanding for effective BLV guidance. GuideDog and GuideDogQA will advance research in MLLM-based assistive technologies for BLV individuals while contributing to broader applications in understanding egocentric scenes for robotics and augmented reality. The code and dataset will be publicly available.
Related papers
- Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 3.49 million questions and 3.32 million images.
Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives.
We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z) - Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users [42.132487737233845]
This paper explores the effectiveness of Multimodal Large Language models (MLLMs) as assistive technologies for visually impaired individuals.
We conduct a user survey to identify adoption patterns and key challenges users face with such technologies.
arXiv Detail & Related papers (2025-03-28T16:54:25Z) - DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding [61.26026947423187]
Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features.<n>Current Multimodal Large Language Models (MLLMs) struggle to integrate reasoning into visual perception.<n>We propose DeepPerception, an MLLM enhanced with cognitive visual perception capabilities.
arXiv Detail & Related papers (2025-03-17T04:06:34Z) - Can LVLMs and Automatic Metrics Capture Underlying Preferences of Blind and Low-Vision Individuals for Navigational Aid? [16.31494394717809]
Blind and Low-Vision (BLV) people need assistance understanding their surroundings, especially in unfamiliar environments.<n>It has yet been studied preferences of BLV users on diverse types/styles of responses from Large Vision-Language Models (LVLMs)<n>We first construct Eye4B dataset, consisting of human-validated 1.1k curated outdoor/indoor scenes with 5-10 relevant requests per scene.<n>Then, we conduct an in-depth user study with eight BLV users to evaluate their preferences on six LVLMs from five perspectives: Afraidness, Nonactionability, Sufficiency, and Conciseness.
arXiv Detail & Related papers (2025-02-15T10:17:52Z) - WalkVLM:Aid Visually Impaired People Walking by Vision Language Model [29.340362062804967]
We introduce the first large-scale dataset dedicated to walking assistance, comprising 12,000 video-annotation pairs.<n>A WalkVLM model is proposed, which employs chain of thought for hierarchical planning to generate concise but informative reminders.<n>We have established a solid benchmark for blind walking task and verified the advantages of WalkVLM in stream video processing.
arXiv Detail & Related papers (2024-12-30T12:29:02Z) - HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data [55.739633494946204]
We present HumanVBench, an innovative benchmark meticulously crafted to bridge gaps in the evaluation of video MLLMs.<n>HumanVBench comprises 16 carefully designed tasks that explore two primary dimensions: inner emotion and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects.<n>A comprehensive evaluation across 22 SOTA video MLLMs reveals notable limitations in current performance, especially in cross-modal and emotion perception.
arXiv Detail & Related papers (2024-12-23T13:45:56Z) - VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment [55.7956150385255]
We investigate the efficacy of AI feedback to scale supervision for aligning vision-language models.
We introduce VLFeedback, the first large-scale vision-language feedback dataset.
We train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback.
arXiv Detail & Related papers (2024-10-12T07:56:47Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models [86.85389322710674]
This work presents an early and holistic evaluation of Large Vision-Language Models (LVLMs)
It proposes a lightweight variant of LVLM-eHub, named Tiny LVLM-eHub.
It provides a systematic assessment of six categories of multimodal capabilities, including visual perception, visual knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence.
arXiv Detail & Related papers (2023-08-07T17:17:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.