Related papers: "I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments

"I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments

URL: http://arxiv.org/abs/2505.04488v1
Date: Wed, 07 May 2025 15:03:16 GMT
Title: "I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments
Authors: Ziyi Zhang, Zhen Sun, Zongmin Zhang, Zifan Peng, Yuemeng Zhao, Zichun Wang, Zeren Luo, Ruiting Zuo, Xinlei He,
Abstract summary: The visually impaired population is currently large in scale, and daily activities pose significant challenges for them.<n>Many studies use large language and vision-language models to assist the blind, most focus on static content and fail to meet real-time perception needs.<n>To provide them with more effective intelligent assistance, it is imperative to incorporate advanced visual understanding technologies.
Score: 17.702424914454415
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The visually impaired population, especially the severely visually impaired, is currently large in scale, and daily activities pose significant challenges for them. Although many studies use large language and vision-language models to assist the blind, most focus on static content and fail to meet real-time perception needs in dynamic and complex environments, such as daily activities. To provide them with more effective intelligent assistance, it is imperative to incorporate advanced visual understanding technologies. Although real-time vision and speech interaction VideoLLMs demonstrate strong real-time visual understanding, no prior work has systematically evaluated their effectiveness in assisting visually impaired individuals. In this work, we conduct the first such evaluation. First, we construct a benchmark dataset (VisAssistDaily), covering three categories of assistive tasks for visually impaired individuals: Basic Skills, Home Life Tasks, and Social Life Tasks. The results show that GPT-4o achieves the highest task success rate. Next, we conduct a user study to evaluate the models in both closed-world and open-world scenarios, further exploring the practical challenges of applying VideoLLMs in assistive contexts. One key issue we identify is the difficulty current models face in perceiving potential hazards in dynamic environments. To address this, we build an environment-awareness dataset named SafeVid and introduce a polling mechanism that enables the model to proactively detect environmental risks. We hope this work provides valuable insights and inspiration for future research in this field.

Related papers

Learning to See and Act: Task-Aware View Planning for Robotic Manipulation [85.65102094981802]
Task-Aware View Planning (TAVP) is a framework designed to integrate active view planning with task-specific representation learning.<n>Our proposed TAVP model achieves superior performance over state-of-the-art fixed-view approaches.
arXiv Detail & Related papers (2025-08-07T09:21:20Z)
Probing the Gaps in ChatGPT Live Video Chat for Real-World Assistance for People who are Blind or Visually Impaired [10.648018999640758]
We present findings from an exploratory study with eight blind or visually impaired (BVI) participants.<n>Our findings indicate that current live video AI effectively provides guidance and answers for static visual scenes but falls short in delivering essential live descriptions required in dynamic situations.<n>We discuss implications for assistive video AI agents, including incorporating additional sensing capabilities for real-world use.
arXiv Detail & Related papers (2025-08-05T16:59:02Z)
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models [66.833085504228]
We introduce V4DLM, the first benchmark specifically designed to evaluate visual language models (VLMs)<n>Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs.<n>We identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models.
arXiv Detail & Related papers (2025-08-04T06:06:06Z)
A Light and Smart Wearable Platform with Multimodal Foundation Model for Enhanced Spatial Reasoning in People with Blindness and Low Vision [9.057330310306696]
People with blindness and low vision (pBLV) face significant challenges, struggling to navigate environments and locate objects due to limited visual cues.<n>Current multi-modal large language (MLLM) models for low vision people lack the spatial reasoning capabilities needed to effectively assist in these tasks.<n>We propose a novel spatial enhanced multi-modal large language model based approach for visually impaired individuals.
arXiv Detail & Related papers (2025-05-16T05:32:25Z)
V$^2$R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations [1.7971686967440696]
V$2$R-Bench is a benchmark framework for evaluating Visual Variation Robustness of LVLMs.<n>We show that advanced models that excel at complex vision-language tasks significantly underperform on simple tasks such as object recognition.<n>These vulnerabilities stem from error accumulation in the pipeline architecture and inadequate multimodal alignment.
arXiv Detail & Related papers (2025-04-23T14:01:32Z)
Visual Language Models show widespread visual deficits on neuropsychological tests [0.0]
We use the toolkit of neuropsychology to assess the capabilities of three state-of-the-art Visual Language Models (VLMs)<n>We find widespread deficits in low- and mid-level visual abilities that would be considered clinically significant in humans.<n>These selective deficits, profiled through validated test batteries, suggest that an artificial system can achieve complex object recognition without developing foundational visual concepts that in humans require no explicit training.
arXiv Detail & Related papers (2025-04-15T01:04:56Z)
Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy [53.07517728420411]
We introduce the first instruction database specifically focused on hallucinations in low-level vision tasks.<n>We propose the Self-Awareness Failure Elimination (SAFEQA) model to improve the perception and comprehension abilities of the model in low-level vision tasks.<n>We conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations.
arXiv Detail & Related papers (2025-03-26T16:05:01Z)
Evaluating the Effectiveness of Video Anomaly Detection in the Wild: Online Learning and Inference for Real-world Deployment [2.1374208474242815]
Video Anomaly Detection (VAD) identifies unusual activities in video streams, a key technology with broad applications ranging from surveillance to healthcare. Tackling VAD in real-life settings poses significant challenges due to the dynamic nature of human actions, environmental variations, and domain shifts. Online learning is a potential strategy to mitigate this issue by allowing models to adapt to new information continuously.
arXiv Detail & Related papers (2024-04-29T14:47:32Z)
Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors [49.99728312519117]
The aim of this work is to establish how accurately a recent semantic-based active perception model is able to complete visual tasks that are regularly performed by humans. This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations. In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model.
arXiv Detail & Related papers (2024-04-16T18:15:57Z)
Effectiveness Assessment of Recent Large Vision-Language Models [78.69439393646554]
This paper endeavors to evaluate the competency of popular large vision-language models (LVLMs) in specialized and general tasks. We employ six challenging tasks in three different application scenarios: natural, healthcare, and industrial. We examine the performance of three recent open-source LVLMs, including MiniGPT-v2, LLaVA-1.5, and Shikra, on both visual recognition and localization in these tasks.
arXiv Detail & Related papers (2024-03-07T08:25:27Z)
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.<n>Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.<n>We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z)
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.<n> MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z)
Continual Visual Reinforcement Learning with A Life-Long World Model [55.05017177980985]
We present a new continual learning approach for visual dynamics modeling.<n>We first introduce the life-long world model, which learns task-specific latent dynamics.<n>Then, we address the value estimation challenge for previous tasks with the exploratory-conservative behavior learning approach.
arXiv Detail & Related papers (2023-03-12T05:08:03Z)
Visual Adversarial Imitation Learning using Variational Models [60.69745540036375]
Reward function specification remains a major impediment for learning behaviors through deep reinforcement learning. Visual demonstrations of desired behaviors often presents an easier and more natural way to teach agents. We develop a variational model-based adversarial imitation learning algorithm.
arXiv Detail & Related papers (2021-07-16T00:15:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.