A Large Vision-Language Model based Environment Perception System for Visually Impaired People
- URL: http://arxiv.org/abs/2504.18027v1
- Date: Fri, 25 Apr 2025 02:46:22 GMT
- Title: A Large Vision-Language Model based Environment Perception System for Visually Impaired People
- Authors: Zezhou Chen, Zhaoxiang Liu, Kai Wang, Kohou Wang, Shiguo Lian,
- Abstract summary: This paper introduces a Large Vision-Language Model(LVLM) based environment perception system.<n>The system helps visually impaired people to perceive the surrounding environment effectively.
- Score: 3.787034006536037
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: It is a challenging task for visually impaired people to perceive their surrounding environment due to the complexity of the natural scenes. Their personal and social activities are thus highly limited. This paper introduces a Large Vision-Language Model(LVLM) based environment perception system which helps them to better understand the surrounding environment, by capturing the current scene they face with a wearable device, and then letting them retrieve the analysis results through the device. The visually impaired people could acquire a global description of the scene by long pressing the screen to activate the LVLM output, retrieve the categories of the objects in the scene resulting from a segmentation model by tapping or swiping the screen, and get a detailed description of the objects they are interested in by double-tapping the screen. To help visually impaired people more accurately perceive the world, this paper proposes incorporating the segmentation result of the RGB image as external knowledge into the input of LVLM to reduce the LVLM's hallucination. Technical experiments on POPE, MME and LLaVA-QA90 show that the system could provide a more accurate description of the scene compared to Qwen-VL-Chat, exploratory experiments show that the system helps visually impaired people to perceive the surrounding environment effectively.
Related papers
- Influence of field of view in visual prostheses design: Analysis with a VR system [3.9998518782208783]
We evaluate the influence of field of view with respect to spatial resolution in visual prostheses.<n>Twenty-four normally sighted participants were asked to find and recognize usual objects.<n>Results show that the accuracy and response time decrease when the field of view is increased.
arXiv Detail & Related papers (2025-01-28T22:25:22Z) - AI-based Wearable Vision Assistance System for the Visually Impaired: Integrating Real-Time Object Recognition and Contextual Understanding Using Large Vision-Language Models [0.0]
This paper introduces a novel wearable vision assistance system with artificial intelligence (AI) technology to deliver real-time feedback to a user through a sound beep mechanism.
The system provides detailed descriptions of objects in the user's environment using a large vision language model (LVLM)
arXiv Detail & Related papers (2024-12-28T07:26:39Z) - Mitigating Object Hallucination via Concentric Causal Attention [71.27325347912823]
We show that object hallucination is closely tied with Rotary Position.
RoPE, a widely adopted positional dependency modeling design.
We propose Concentric Causal Attention (CCA), a simple yet effective positional alignment strategy.
arXiv Detail & Related papers (2024-10-21T11:54:53Z) - Learning Object-Centric Representation via Reverse Hierarchy Guidance [73.05170419085796]
Object-Centric Learning (OCL) seeks to enable Neural Networks to identify individual objects in visual scenes.
RHGNet introduces a top-down pathway that works in different ways in the training and inference processes.
Our model achieves SOTA performance on several commonly used datasets.
arXiv Detail & Related papers (2024-05-17T07:48:27Z) - VisionGPT: LLM-Assisted Real-Time Anomaly Detection for Safe Visual Navigation [3.837186701755568]
This paper explores the potential of Large Language Models in zero-shot anomaly detection for safe visual navigation.
The proposed framework can identify anomalies within camera-captured frames that include any possible obstacles, then generate concise, audio-delivered descriptions emphasizing abnormalities.
arXiv Detail & Related papers (2024-03-19T03:55:39Z) - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.<n>Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.<n>We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z) - Good Questions Help Zero-Shot Image Reasoning [110.1671684828904]
Question-Driven Visual Exploration (QVix) is a novel prompting strategy that enhances the exploratory capabilities of large vision-language models (LVLMs)
QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment.
Our evaluations on various challenging zero-shot vision-language benchmarks, including ScienceQA and fine-grained visual classification, demonstrate that QVix significantly outperforms existing methods.
arXiv Detail & Related papers (2023-12-04T03:18:51Z) - A Multi-Modal Foundation Model to Assist People with Blindness and Low Vision in Environmental Interaction [25.6637754177118]
People with blindness and low vision (pBLV) encounter substantial challenges when it comes to comprehensive scene recognition and precise object identification.
We present a pioneering approach that leverages a large vision-language model to enhance visual perception for pBLV.
arXiv Detail & Related papers (2023-10-31T06:56:51Z) - VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.
We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples.
We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z) - Evaluating Object Hallucination in Large Vision-Language Models [122.40337582958453]
This work presents the first systematic study on object hallucination of large vision-language models (LVLMs)
We find that LVLMs tend to generate objects that are inconsistent with the target images in the descriptions.
We propose a polling-based query method called POPE to evaluate the object hallucination.
arXiv Detail & Related papers (2023-05-17T16:34:01Z) - Semantic and structural image segmentation for prosthetic vision [2.048226951354646]
The ability of object recognition and scene understanding in real environments is severely restricted for prosthetic users.<n>We present a new approach to build a schematic representation of indoor environments for phosphene images.<n>The proposed method combines a variety of convolutional neural networks for extracting and conveying relevant information.
arXiv Detail & Related papers (2018-09-25T17:38:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.