Towards Context-aware Support for Color Vision Deficiency: An Approach Integrating LLM and AR
- URL: http://arxiv.org/abs/2407.04362v1
- Date: Fri, 5 Jul 2024 09:03:52 GMT
- Title: Towards Context-aware Support for Color Vision Deficiency: An Approach Integrating LLM and AR
- Authors: Shogo Morita, Yan Zhang, Takuto Yamauchi, Sinan Chen, Jialong Li, Kenji Tei,
- Abstract summary: People with color vision deficiency often face challenges in distinguishing colors such as red and green.
Current support tools mainly focus on presentation-based aids, like the color vision modes found in iPhone accessibility settings.
This paper proposes an application that provides contextual and autonomous assistance.
- Score: 2.4560886170097573
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: People with color vision deficiency often face challenges in distinguishing colors such as red and green, which can complicate daily tasks and require the use of assistive tools or environmental adjustments. Current support tools mainly focus on presentation-based aids, like the color vision modes found in iPhone accessibility settings. However, offering context-aware support, like indicating the doneness of meat, remains a challenge since task-specific solutions are not cost-effective for all possible scenarios. To address this, our paper proposes an application that provides contextual and autonomous assistance. This application is mainly composed of: (i) an augmented reality interface that efficiently captures context; and (ii) a multi-modal large language model-based reasoner that serves to cognitize the context and then reason about the appropriate support contents. Preliminary user experiments with two color vision deficient users across five different scenarios have demonstrated the effectiveness and universality of our application.
Related papers
- I2CR: Intra- and Inter-modal Collaborative Reflections for Multimodal Entity Linking [8.758773321492809]
We propose a novel framework for the multimodal entity linking task, called Intra- and Inter-modal Collaborative Reflections.<n>Our framework consistently outperforms current state-of-the-art methods in the task, achieving improvements of 3.2%, 5.1%, and 1.6%, respectively.
arXiv Detail & Related papers (2025-08-04T09:43:54Z) - True Multimodal In-Context Learning Needs Attention to the Visual Context [69.63677595066012]
Multimodal Large Language Models (MLLMs) have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks.<n>Current MLLMs tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation.<n>We introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context.
arXiv Detail & Related papers (2025-07-21T17:08:18Z) - Diagnosing Vision Language Models' Perception by Leveraging Human Methods for Color Vision Deficiencies [23.761989930955522]
We evaluate Vision Language Models' ability to account for individual level perceptual variation using the Ishihara Test.<n>Our results show that LVLMs can explain Color Vision Deficiencies in natural language, but they cannot simulate how people with CVDs perceive color in image based tasks.
arXiv Detail & Related papers (2025-05-23T04:43:55Z) - Grounding Task Assistance with Multimodal Cues from a Single Demonstration [17.975173937253494]
We introduce MICA (Multimodal Interactive Contextualized Assistance), a framework that improves conversational agents for task assistance by integrating eye gaze and speech cues.<n> Evaluations on questions derived from real-time chat-assisted task replication show that multimodal cues significantly improve response quality over frame-based retrieval.
arXiv Detail & Related papers (2025-05-02T20:43:11Z) - ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness [23.857004537384]
It is unclear whether vision-language models (VLMs) can perceive, understand, and leverage color as humans.
This paper introduces ColorBench, a benchmark to assess the capabilities of VLMs in color understanding.
arXiv Detail & Related papers (2025-04-10T16:36:26Z) - Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users [42.132487737233845]
This paper explores the effectiveness of Multimodal Large Language models (MLLMs) as assistive technologies for visually impaired individuals.
We conduct a user survey to identify adoption patterns and key challenges users face with such technologies.
arXiv Detail & Related papers (2025-03-28T16:54:25Z) - Infrared and Visible Image Fusion: From Data Compatibility to Task Adaption [65.06388526722186]
Infrared-visible image fusion is a critical task in computer vision.
There is a lack of recent comprehensive surveys that address this rapidly expanding domain.
We introduce a multi-dimensional framework to elucidate common learning-based IVIF methods.
arXiv Detail & Related papers (2025-01-18T13:17:34Z) - VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks [48.67062958311173]
VL-GLUE is a multitask benchmark for natural language understanding.
We show that this benchmark is quite challenging for existing large-scale vision-language models.
arXiv Detail & Related papers (2024-10-17T15:27:17Z) - @Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology [31.779074930032184]
Human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneously.
We first create a novel AT benchmark (@Bench) guided by a pre-design user study with PVIs.
Besides, we propose a novel AT model (@Model) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs.
arXiv Detail & Related papers (2024-09-21T18:30:17Z) - StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond [68.0107158115377]
We have crafted an efficient vision-language model, StrucTexTv3, tailored to tackle various intelligent tasks for text-rich images.
We enhance the perception and comprehension abilities of StrucTexTv3 through instruction learning.
Our method achieved SOTA results in text-rich image perception tasks, and significantly improved performance in comprehension tasks.
arXiv Detail & Related papers (2024-05-31T16:55:04Z) - SoMeLVLM: A Large Vision Language Model for Social Media Processing [78.47310657638567]
We introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM)
SoMeLVLM is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation.
Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks.
arXiv Detail & Related papers (2024-02-20T14:02:45Z) - Improving In-Context Learning in Diffusion Models with Visual
Context-Modulated Prompts [83.03471704115786]
We introduce improved Prompt Diffusion (iPromptDiff) in this study.
iPromptDiff integrates an end-to-end trained vision encoder that converts visual context into an embedding vector.
We show that a diffusion-based vision foundation model, when equipped with this visual context-modulated text guidance and a standard ControlNet structure, exhibits versatility and robustness across a variety of training tasks.
arXiv Detail & Related papers (2023-12-03T14:15:52Z) - Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding [55.65727739645824]
Chat-UniVi is a Unified Vision-language model capable of comprehending and engaging in conversations involving images and videos.
We employ a set of dynamic visual tokens to uniformly represent images and videos.
We leverage a multi-scale representation, enabling the model to perceive both high-level semantic concepts and low-level visual details.
arXiv Detail & Related papers (2023-11-14T10:11:36Z) - Unifying Image Processing as Visual Prompting Question Answering [62.84955983910612]
Image processing is a fundamental task in computer vision, which aims at enhancing image quality and extracting essential features for subsequent vision applications.
Traditionally, task-specific models are developed for individual tasks and designing such models requires distinct expertise.
We propose a universal model for general image processing that covers image restoration, image enhancement, image feature extraction tasks.
arXiv Detail & Related papers (2023-10-16T15:32:57Z) - Personalizing image enhancement for critical visual tasks: improved
legibility of papyri using color processing and visual illusions [0.0]
Methods: Novel enhancement algorithms based on color processing and visual illusions are compared to classic methods in a user experience experiment.
Users exhibited a broad behavioral spectrum, under the influence of factors such as personality and social conditioning, tasks and application domains, expertise level and image quality, and affordances of software, hardware, and interfaces.
arXiv Detail & Related papers (2021-03-11T23:48:17Z) - Real-time single image depth perception in the wild with handheld
devices [45.26484111468387]
Two main issues limit depth estimation from handheld devices in-the-wild.
We show how they are both addressable adopting appropriate network design and training strategies.
We report experimental results concerning real-time depth-aware augmented reality and image blurring with smartphones in-the-wild.
arXiv Detail & Related papers (2020-06-10T08:30:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.