Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
- URL: http://arxiv.org/abs/2412.21080v1
- Date: Mon, 30 Dec 2024 16:57:05 GMT
- Title: Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
- Authors: Yifei Huang, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Lijin Yang, Xinyuan Chen, Yaohui Wang, Zheng Nie, Jinyao Liu, Guoshun Fan, Dechen Lin, Fang Fang, Kunpeng Li, Chang Yuan, Yali Wang, Yu Qiao, Limin Wang,
- Abstract summary: We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-language model.
vinci operates in an "always on" mode, continuously observing the environment to deliver seamless interaction and assistance.
We release the complete implementation for the development of the device in conjunction with a demo web platform to test uploaded videos.
- Score: 49.90916095152366
- License:
- Abstract: We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-language model. Designed for deployment on portable devices such as smartphones and wearable cameras, Vinci operates in an "always on" mode, continuously observing the environment to deliver seamless interaction and assistance. Users can wake up the system and engage in natural conversations to ask questions or seek assistance, with responses delivered through audio for hands-free convenience. With its ability to process long video streams in real-time, Vinci can answer user queries about current observations and historical context while also providing task planning based on past interactions. To further enhance usability, Vinci integrates a video generation module that creates step-by-step visual demonstrations for tasks that require detailed guidance. We hope that Vinci can establish a robust framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights. We release the complete implementation for the development of the device in conjunction with a demo web platform to test uploaded videos at https://github.com/OpenGVLab/vinci.
Related papers
- AI-based Wearable Vision Assistance System for the Visually Impaired: Integrating Real-Time Object Recognition and Contextual Understanding Using Large Vision-Language Models [0.0]
This paper introduces a novel wearable vision assistance system with artificial intelligence (AI) technology to deliver real-time feedback to a user through a sound beep mechanism.
The system provides detailed descriptions of objects in the user's environment using a large vision language model (LVLM)
arXiv Detail & Related papers (2024-12-28T07:26:39Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
We introduce Aguvis, a unified vision-based framework for autonomous GUI agents.
Our approach leverages image-based observations, and grounding instructions in natural language to visual elements.
To address the limitations of previous work, we integrate explicit planning and reasoning within the model.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - Turn-by-Turn Indoor Navigation for the Visually Impaired [0.0]
Navigating indoor environments presents significant challenges for visually impaired individuals.
This paper introduces a novel system that provides turn-by-turn navigation inside buildings using only a smartphone equipped with a camera.
Preliminary evaluations demonstrate the system's effectiveness in accurately guiding users through complex indoor spaces.
arXiv Detail & Related papers (2024-10-25T20:16:38Z) - Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs [95.8010627763483]
Mobility VLA is a hierarchical Vision-Language-Action (VLA) navigation policy that combines the environment understanding and common sense reasoning power of long-context VLMs.
We show that Mobility VLA has a high end-to-end success rates on previously unsolved multimodal instructions.
arXiv Detail & Related papers (2024-07-10T15:49:07Z) - AIris: An AI-powered Wearable Assistive Device for the Visually Impaired [0.0]
We introduce AIris, an AI-powered wearable device that provides environmental awareness and interaction capabilities to visually impaired users.
We have created a functional prototype system that operates effectively in real-world conditions.
arXiv Detail & Related papers (2024-05-13T10:09:37Z) - Training a Vision Language Model as Smartphone Assistant [1.3654846342364308]
We present a visual language model (VLM) that can fulfill diverse tasks on mobile devices.
Our model functions by interacting solely with the user interface (UI)
Unlike previous methods, our model operates not only on a single screen image but on vision-language sentences created from sequences of past screenshots.
arXiv Detail & Related papers (2024-04-12T18:28:44Z) - DRISHTI: Visual Navigation Assistant for Visually Impaired [0.0]
Blind and visually impaired (BVI) people face challenges because they need manual support to prompt information about their environment.
In this work, we took our first step towards developing an affordable and high-performing eye wearable assistive device, DRISHTI.
arXiv Detail & Related papers (2023-03-13T20:10:44Z) - Play it by Ear: Learning Skills amidst Occlusion through Audio-Visual
Imitation Learning [62.83590925557013]
We learn a set of challenging partially-observed manipulation tasks from visual and audio inputs.
Our proposed system learns these tasks by combining offline imitation learning from tele-operated demonstrations and online finetuning.
In a set of simulated tasks, we find that our system benefits from using audio, and that by using online interventions we are able to improve the success rate of offline imitation learning by 20%.
arXiv Detail & Related papers (2022-05-30T04:52:58Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - AEGIS: A real-time multimodal augmented reality computer vision based
system to assist facial expression recognition for individuals with autism
spectrum disorder [93.0013343535411]
This paper presents the development of a multimodal augmented reality (AR) system which combines the use of computer vision and deep convolutional neural networks (CNN)
The proposed system, which we call AEGIS, is an assistive technology deployable on a variety of user devices including tablets, smartphones, video conference systems, or smartglasses.
We leverage both spatial and temporal information in order to provide an accurate expression prediction, which is then converted into its corresponding visualization and drawn on top of the original video frame.
arXiv Detail & Related papers (2020-10-22T17:20:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.