Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions
- URL: http://arxiv.org/abs/2503.13369v1
- Date: Mon, 17 Mar 2025 16:52:46 GMT
- Title: Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions
- Authors: Wan Ju Kang, Eunki Kim, Na Min An, Sangryul Kim, Haemin Choi, Ki Hoon Kwak, James Thorne,
- Abstract summary: We ask sighted individuals to assess -- rather than produce -- diagram descriptions generated by vision-language models (VLM)<n>We release Sightation, a collection of diagram description datasets spanning 5k diagrams and 137k samples for completion, preference, retrieval, question answering, and reasoning training purposes.
- Score: 5.6629291915019975
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Often, the needs and visual abilities differ between the annotator group and the end user group. Generating detailed diagram descriptions for blind and low-vision (BLV) users is one such challenging domain. Sighted annotators could describe visuals with ease, but existing studies have shown that direct generations by them are costly, bias-prone, and somewhat lacking by BLV standards. In this study, we ask sighted individuals to assess -- rather than produce -- diagram descriptions generated by vision-language models (VLM) that have been guided with latent supervision via a multi-pass inference. The sighted assessments prove effective and useful to professional educators who are themselves BLV and teach visually impaired learners. We release Sightation, a collection of diagram description datasets spanning 5k diagrams and 137k samples for completion, preference, retrieval, question answering, and reasoning training purposes and demonstrate their fine-tuning potential in various downstream tasks.
Related papers
- Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models [127.38740043393527]
We propose ViFT, a visual instruction-free fine-tuning framework for LVLMs.<n>We only require the text-only instructions and image caption data during training, to separately learn the task-solving and visual perception abilities.<n> Experimental results demonstrate that ViFT can achieve state-of-the-art performance on several visual reasoning and visual instruction following benchmarks.
arXiv Detail & Related papers (2025-02-17T04:38:12Z) - Can LVLMs and Automatic Metrics Capture Underlying Preferences of Blind and Low-Vision Individuals for Navigational Aid? [16.31494394717809]
Blind and Low-Vision (BLV) people need assistance understanding their surroundings, especially in unfamiliar environments.<n>It has yet been studied preferences of BLV users on diverse types/styles of responses from Large Vision-Language Models (LVLMs)<n>We first construct Eye4B dataset, consisting of human-validated 1.1k curated outdoor/indoor scenes with 5-10 relevant requests per scene.<n>Then, we conduct an in-depth user study with eight BLV users to evaluate their preferences on six LVLMs from five perspectives: Afraidness, Nonactionability, Sufficiency, and Conciseness.
arXiv Detail & Related papers (2025-02-15T10:17:52Z) - FiVL: A Framework for Improved Vision-Language Alignment [10.184567639685321]
We introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding.<n>These datasets can be utilized for both training and assessing an LVLM's ability to use image content as substantive evidence.<n>To demonstrate the utility of our dataset, we introduce an innovative training task that outperforms baselines alongside a validation method and application for explainability.
arXiv Detail & Related papers (2024-12-19T09:24:10Z) - Generating Contextually-Relevant Navigation Instructions for Blind and Low Vision People [9.503205949175966]
Navigating unfamiliar environments presents significant challenges for blind and low-vision (BLV) individuals.
We construct a dataset of images and goals across different scenarios such as searching through kitchens or navigating outdoors.
arXiv Detail & Related papers (2024-07-11T06:40:36Z) - Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning [92.85265959892115]
This paper introduces the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction.
Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers.
To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts.
arXiv Detail & Related papers (2023-06-26T10:26:33Z) - Evaluating how interactive visualizations can assist in finding samples where and how computer vision models make mistakes [1.76602679361245]
We present two interactive visualizations in the context of Sprite, a system for creating Computer Vision (CV) models.
We study how these visualizations help Sprite's users identify (evaluate) and select (plan) images where a model is struggling and can lead to improved performance.
arXiv Detail & Related papers (2023-05-19T14:43:00Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Semi-Supervised and Unsupervised Deep Visual Learning: A Survey [76.2650734930974]
Semi-supervised learning and unsupervised learning offer promising paradigms to learn from an abundance of unlabeled visual data.
We review the recent advanced deep learning algorithms on semi-supervised learning (SSL) and unsupervised learning (UL) for visual recognition from a unified perspective.
arXiv Detail & Related papers (2022-08-24T04:26:21Z) - Visual Distant Supervision for Scene Graph Generation [66.10579690929623]
Scene graph models usually require supervised learning on large quantities of labeled data with intensive human annotation.
We propose visual distant supervision, a novel paradigm of visual relation learning, which can train scene graph models without any human-labeled data.
Comprehensive experimental results show that our distantly supervised model outperforms strong weakly supervised and semi-supervised baselines.
arXiv Detail & Related papers (2021-03-29T06:35:24Z) - Quantifying Learnability and Describability of Visual Concepts Emerging
in Representation Learning [91.58529629419135]
We consider how to characterise visual groupings discovered automatically by deep neural networks.
We introduce two concepts, visual learnability and describability, that can be used to quantify the interpretability of arbitrary image groupings.
arXiv Detail & Related papers (2020-10-27T18:41:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.