TEXT2TASTE: A Versatile Egocentric Vision System for Intelligent Reading Assistance Using Large Language Model
- URL: http://arxiv.org/abs/2404.09254v1
- Date: Sun, 14 Apr 2024 13:39:02 GMT
- Title: TEXT2TASTE: A Versatile Egocentric Vision System for Intelligent Reading Assistance Using Large Language Model
- Authors: Wiktor Mucha, Florin Cuconasu, Naome A. Etori, Valia Kalokyri, Giovanni Trappolini,
- Abstract summary: We propose an intelligent reading assistant based on smart glasses with embedded RGB cameras and a Large Language Model (LLM)
The video recorded from the egocentric perspective of a person wearing the glasses is processed to localise text information using object detection and optical character recognition methods.
The LLM processes the data and allows the user to interact with the text and responds to a given query, thus extending the functionality of corrective lenses.
- Score: 2.2469442203227863
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The ability to read, understand and find important information from written text is a critical skill in our daily lives for our independence, comfort and safety. However, a significant part of our society is affected by partial vision impairment, which leads to discomfort and dependency in daily activities. To address the limitations of this part of society, we propose an intelligent reading assistant based on smart glasses with embedded RGB cameras and a Large Language Model (LLM), whose functionality goes beyond corrective lenses. The video recorded from the egocentric perspective of a person wearing the glasses is processed to localise text information using object detection and optical character recognition methods. The LLM processes the data and allows the user to interact with the text and responds to a given query, thus extending the functionality of corrective lenses with the ability to find and summarize knowledge from the text. To evaluate our method, we create a chat-based application that allows the user to interact with the system. The evaluation is conducted in a real-world setting, such as reading menus in a restaurant, and involves four participants. The results show robust accuracy in text retrieval. The system not only provides accurate meal suggestions but also achieves high user satisfaction, highlighting the potential of smart glasses and LLMs in assisting people with special needs.
Related papers
- AIris: An AI-powered Wearable Assistive Device for the Visually Impaired [0.0]
We introduce AIris, an AI-powered wearable device that provides environmental awareness and interaction capabilities to visually impaired users.
We have created a functional prototype system that operates effectively in real-world conditions.
arXiv Detail & Related papers (2024-05-13T10:09:37Z) - Interactive Analysis of LLMs using Meaningful Counterfactuals [22.755345889167934]
Counterfactual examples are useful for exploring the decision boundaries of machine learning models.
How can we apply counterfactual-based methods to analyze and explain LLMs?
We propose a novel algorithm for generating batches of complete and meaningful textual counterfactuals.
In our experiments, 97.2% of the counterfactuals are grammatically correct.
arXiv Detail & Related papers (2024-04-23T19:57:03Z) - Analyzing the Roles of Language and Vision in Learning from Limited Data [31.895396236504993]
We study the contributions that language and vision make to learning about the world.
We find that a language model leveraging all components recovers a majority of a Vision-Language Model's performance.
arXiv Detail & Related papers (2024-02-15T22:19:41Z) - Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection [51.66174565170112]
We introduce a novel approach to utilize the strengths of large language models in understanding contextual appearance variations.
We propose to formulate language-derived appearance elements and incorporate them with visual cues in pedestrian detection.
arXiv Detail & Related papers (2023-11-02T06:38:19Z) - TouchStone: Evaluating Vision-Language Models by Language Models [91.69776377214814]
We propose an evaluation method that uses strong large language models as judges to comprehensively evaluate the various abilities of LVLMs.
We construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks.
We demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone.
arXiv Detail & Related papers (2023-08-31T17:52:04Z) - Leveraging Visual Knowledge in Language Tasks: An Empirical Study on
Intermediate Pre-training for Cross-modal Knowledge Transfer [61.34424171458634]
We study whether integrating visual knowledge into a language model can fill the gap.
Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings.
arXiv Detail & Related papers (2022-03-14T22:02:40Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - VisBuddy -- A Smart Wearable Assistant for the Visually Challenged [0.0]
VisBuddy is a voice-based assistant, where the user can give voice commands to perform specific tasks.
It uses the techniques of image captioning for describing the user's surroundings, optical character recognition (OCR) for reading the text in the user's view, object detection to search and find the objects in a room and web scraping to give the user the latest news.
arXiv Detail & Related papers (2021-08-17T17:15:23Z) - Readability Research: An Interdisciplinary Approach [62.03595526230364]
We aim to provide a firm foundation for readability research, a comprehensive framework for readability research.
Readability refers to aspects of visual information design which impact information flow from the page to the reader.
These aspects can be modified on-demand, instantly improving the ease with which a reader can process and derive meaning from text.
arXiv Detail & Related papers (2021-07-20T16:52:17Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.