Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users
- URL: http://arxiv.org/abs/2503.22610v1
- Date: Fri, 28 Mar 2025 16:54:25 GMT
- Title: Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users
- Authors: Antonia Karamolegkou, Malvina Nikandrou, Georgios Pantazopoulos, Danae Sanchez Villegas, Phillip Rust, Ruchira Dhar, Daniel Hershcovich, Anders Søgaard,
- Abstract summary: This paper explores the effectiveness of Multimodal Large Language models (MLLMs) as assistive technologies for visually impaired individuals.<n>We conduct a user survey to identify adoption patterns and key challenges users face with such technologies.
- Score: 42.132487737233845
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper explores the effectiveness of Multimodal Large Language models (MLLMs) as assistive technologies for visually impaired individuals. We conduct a user survey to identify adoption patterns and key challenges users face with such technologies. Despite a high adoption rate of these models, our findings highlight concerns related to contextual understanding, cultural sensitivity, and complex scene understanding, particularly for individuals who may rely solely on them for visual interpretation. Informed by these results, we collate five user-centred tasks with image and video inputs, including a novel task on Optical Braille Recognition. Our systematic evaluation of twelve MLLMs reveals that further advancements are necessary to overcome limitations related to cultural context, multilingual support, Braille reading comprehension, assistive object recognition, and hallucinations. This work provides critical insights into the future direction of multimodal AI for accessibility, underscoring the need for more inclusive, robust, and trustworthy visual assistance technologies.
Related papers
- Exploring Multimodal Prompt for Visualization Authoring with Large Language Models [12.43647167483504]
We study how large language models (LLMs) interpret ambiguous or incomplete text prompts in the context of visualization authoring.
We introduce visual prompts as a complementary input modality to text prompts, which help clarify user intent.
We design VisPilot, which enables users to easily create visualizations using multimodal prompts, including text, sketches, and direct manipulations.
arXiv Detail & Related papers (2025-04-18T14:00:55Z) - Personalized Multimodal Large Language Models: A Survey [127.9521218125761]
Multimodal Large Language Models (MLLMs) have become increasingly important due to their state-of-the-art performance and ability to integrate multiple data modalities.
This paper presents a comprehensive survey on personalized multimodal large language models, focusing on their architecture, training methods, and applications.
arXiv Detail & Related papers (2024-12-03T03:59:03Z) - Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey [46.617998833238126]
Large language models (LLMs) and computer vision (CV) systems driving advancements in natural language understanding and visual processing.<n>The convergence of these technologies has catalyzed the rise of multimodal AI, enabling richer, cross-modal understanding that spans text, vision, audio, and video modalities.<n>Multimodal large language models (MLLMs) have emerged as a powerful framework, demonstrating impressive capabilities in tasks like image-text generation, visual question answering, and cross-modal retrieval.<n>Despite these advancements, the complexity and scale of MLLMs introduce significant challenges in interpretability and explainability, essential for establishing
arXiv Detail & Related papers (2024-12-03T02:54:31Z) - Visual Error Patterns in Multi-Modal AI: A Statistical Approach [0.0]
Multi-modal large language models (MLLMs) excel at integrating text and visual data but face systematic challenges when interpreting ambiguous or incomplete visual stimuli.<n>This study leverages statistical modeling to analyze the factors driving these errors, using a dataset of geometric stimuli characterized by features like 3D, rotation, and missing face/side.
arXiv Detail & Related papers (2024-11-27T01:20:08Z) - LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models [60.67899965748755]
We present LLaVA-Read, a multimodal large language model that utilizes dual visual encoders along with a visual text encoder.
Our research suggests visual text understanding remains an open challenge and an efficient visual text encoder is crucial for future successful multimodal systems.
arXiv Detail & Related papers (2024-07-27T05:53:37Z) - MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception [24.406224705072763]
Mutually Reinforced Multimodal Large Language Model (MR-MLLM) is a novel framework that enhances visual perception and multimodal comprehension.
First, a shared query fusion mechanism is proposed to harmonize detailed visual inputs from vision models with the linguistic depth of language models.
Second, we propose the perception-enhanced cross-modal integration method, incorporating novel modalities from vision perception outputs.
arXiv Detail & Related papers (2024-06-22T07:10:36Z) - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [50.77984109941538]
Our research reveals that the visual capabilities in recent multimodal LLMs still exhibit systematic shortcomings.
We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences.
We evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs.
arXiv Detail & Related papers (2024-01-11T18:58:36Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.
MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.