Can LVLMs and Automatic Metrics Capture Underlying Preferences of Blind and Low-Vision Individuals for Navigational Aid?
- URL: http://arxiv.org/abs/2502.14883v1
- Date: Sat, 15 Feb 2025 10:17:52 GMT
- Title: Can LVLMs and Automatic Metrics Capture Underlying Preferences of Blind and Low-Vision Individuals for Navigational Aid?
- Authors: Na Min An, Eunki Kim, Wan Ju Kang, Sangryul Kim, Hyunjung Shim, James Thorne,
- Abstract summary: Blind and Low-Vision (BLV) people need assistance understanding their surroundings, especially in unfamiliar environments.<n>It has yet been studied preferences of BLV users on diverse types/styles of responses from Large Vision-Language Models (LVLMs)<n>We first construct Eye4B dataset, consisting of human-validated 1.1k curated outdoor/indoor scenes with 5-10 relevant requests per scene.<n>Then, we conduct an in-depth user study with eight BLV users to evaluate their preferences on six LVLMs from five perspectives: Afraidness, Nonactionability, Sufficiency, and Conciseness.
- Score: 16.31494394717809
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision is a primary means of how humans perceive the environment, but Blind and Low-Vision (BLV) people need assistance understanding their surroundings, especially in unfamiliar environments. The emergence of semantic-based systems as assistance tools for BLV users has motivated many researchers to explore responses from Large Vision-Language Models (LVLMs). However, it has yet been studied preferences of BLV users on diverse types/styles of responses from LVLMs, specifically for navigational aid. To fill this gap, we first construct Eye4B dataset, consisting of human-validated 1.1k curated outdoor/indoor scenes with 5-10 relevant requests per scene. Then, we conduct an in-depth user study with eight BLV users to evaluate their preferences on six LVLMs from five perspectives: Afraidness, Nonactionability, Sufficiency, and Conciseness. Finally, we introduce Eye4B benchmark for evaluating alignment between widely used model-based image-text metrics and our collected BLV preferences. Our work can be set as a guideline for developing BLV-aware LVLMs towards a Barrier-Free AI system.
Related papers
- Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions [5.6629291915019975]
We ask sighted individuals to assess -- rather than produce -- diagram descriptions generated by vision-language models (VLM)
We release Sightation, a collection of diagram description datasets spanning 5k diagrams and 137k samples for completion, preference, retrieval, question answering, and reasoning training purposes.
arXiv Detail & Related papers (2025-03-17T16:52:46Z) - GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance [18.467461615621872]
Mobility remains a significant challenge for the 2.2 billion people worldwide affected by blindness and low vision (BLV)
We introduce GuideDog, a novel accessibility-aware guide dataset containing 22K image-description pairs.
We also develop GuideDogQA, a subset of 818 samples featuring multiple-choice questions designed to evaluate fine-grained visual perception capabilities.
arXiv Detail & Related papers (2025-03-17T05:43:40Z) - Generating Contextually-Relevant Navigation Instructions for Blind and Low Vision People [9.503205949175966]
Navigating unfamiliar environments presents significant challenges for blind and low-vision (BLV) individuals.
We construct a dataset of images and goals across different scenarios such as searching through kitchens or navigating outdoors.
arXiv Detail & Related papers (2024-07-11T06:40:36Z) - WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences [122.87483437694706]
We launch WildVision-Arena (WV-Arena), an online platform that collects human preferences to evaluate vision-language models (VLMs)
WV-Bench uses GPT-4 as the judge to compare each VLM with Claude-3-Sonnet, achieving a Spearman correlation of 0.94 with the WV-Arena Elo.
Our comprehensive analysis of 20K real-world interactions reveals important insights into the failure cases of top-performing VLMs.
arXiv Detail & Related papers (2024-06-16T20:53:25Z) - B-AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Black-box Adversarial Visual-Instructions [73.97665608366447]
Large Vision-Language Models (LVLMs) have shown significant progress in responding well to visual-instructions from users.<n>These instructions, encompassing images and text, are susceptible to both intentional and inadvertent attacks.<n>We introduce B-AVIBench, a framework designed to analyze the robustness of LVLMs when facing various Black-box Adrial Visual-Instructions.
arXiv Detail & Related papers (2024-03-14T12:51:07Z) - Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models [57.95366341738857]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.<n>We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z) - TouchStone: Evaluating Vision-Language Models by Language Models [91.69776377214814]
We propose an evaluation method that uses strong large language models as judges to comprehensively evaluate the various abilities of LVLMs.
We construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks.
We demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone.
arXiv Detail & Related papers (2023-08-31T17:52:04Z) - TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models [86.85389322710674]
This work presents an early and holistic evaluation of Large Vision-Language Models (LVLMs)
It proposes a lightweight variant of LVLM-eHub, named Tiny LVLM-eHub.
It provides a systematic assessment of six categories of multimodal capabilities, including visual perception, visual knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence.
arXiv Detail & Related papers (2023-08-07T17:17:05Z) - LVLM-eHub: A Comprehensive Evaluation Benchmark for Large
Vision-Language Models [55.304181390027274]
This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub)
Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform.
The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario.
arXiv Detail & Related papers (2023-06-15T16:39:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.