Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
- URL: http://arxiv.org/abs/2412.21080v1
- Date: Mon, 30 Dec 2024 16:57:05 GMT
- Title: Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
- Authors: Yifei Huang, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Lijin Yang, Xinyuan Chen, Yaohui Wang, Zheng Nie, Jinyao Liu, Guoshun Fan, Dechen Lin, Fang Fang, Kunpeng Li, Chang Yuan, Yali Wang, Yu Qiao, Limin Wang,
- Abstract summary: We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-language model.<n>vinci operates in an "always on" mode, continuously observing the environment to deliver seamless interaction and assistance.<n>We release the complete implementation for the development of the device in conjunction with a demo web platform to test uploaded videos.
- Score: 49.90916095152366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-language model. Designed for deployment on portable devices such as smartphones and wearable cameras, Vinci operates in an "always on" mode, continuously observing the environment to deliver seamless interaction and assistance. Users can wake up the system and engage in natural conversations to ask questions or seek assistance, with responses delivered through audio for hands-free convenience. With its ability to process long video streams in real-time, Vinci can answer user queries about current observations and historical context while also providing task planning based on past interactions. To further enhance usability, Vinci integrates a video generation module that creates step-by-step visual demonstrations for tasks that require detailed guidance. We hope that Vinci can establish a robust framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights. We release the complete implementation for the development of the device in conjunction with a demo web platform to test uploaded videos at https://github.com/OpenGVLab/vinci.
Related papers
- A Human Digital Twin Architecture for Knowledge-based Interactions and Context-Aware Conversations [0.9580312063277943]
Recent developments in Artificial Intelligence (AI) and Machine Learning (ML) are creating new opportunities for Human-Autonomy Teaming (HAT)
We present a real-time Human Digital Twin (HDT) architecture that integrates Large Language Models (LLMs) for knowledge reporting, answering, and recommendation, embodied in a visual interface.
The HDT acts as a visually and behaviorally realistic team member, integrated throughout the mission lifecycle, from training to deployment to after-action review.
arXiv Detail & Related papers (2025-04-04T03:56:26Z) - An Egocentric Vision-Language Model based Portable Real-time Smart Assistant [50.324455115241186]
We present Vinci, a vision-language system designed to provide real-time, comprehensive AI assistance on portable devices.
At its core, Vinci leverages EgoVideo-VL, a novel model that integrates an egocentric vision foundation model with a large language model.
vinci is hardware-agnostic, supporting deployment across a wide range of devices, including smartphones and wearable cameras.
arXiv Detail & Related papers (2025-03-06T09:33:46Z) - EgoLife: Towards Egocentric Life Assistant [60.51196061794498]
We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses.
We conduct a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references.
This effort resulted in the EgoLife dataset, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation.
We introduce EgoLifeQA, a suite of long-context, life-oriented question-answering tasks designed to provide
arXiv Detail & Related papers (2025-03-05T18:54:16Z) - AI-based Wearable Vision Assistance System for the Visually Impaired: Integrating Real-Time Object Recognition and Contextual Understanding Using Large Vision-Language Models [0.0]
This paper introduces a novel wearable vision assistance system with artificial intelligence (AI) technology to deliver real-time feedback to a user through a sound beep mechanism.<n>The system provides detailed descriptions of objects in the user's environment using a large vision language model (LVLM)
arXiv Detail & Related papers (2024-12-28T07:26:39Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
We introduce Aguvis, a unified vision-based framework for autonomous GUI agents.<n>Our approach leverages image-based observations, and grounding instructions in natural language to visual elements.<n>To address the limitations of previous work, we integrate explicit planning and reasoning within the model.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - AIris: An AI-powered Wearable Assistive Device for the Visually Impaired [0.0]
We introduce AIris, an AI-powered wearable device that provides environmental awareness and interaction capabilities to visually impaired users.
We have created a functional prototype system that operates effectively in real-world conditions.
arXiv Detail & Related papers (2024-05-13T10:09:37Z) - Training a Vision Language Model as Smartphone Assistant [1.3654846342364308]
We present a visual language model (VLM) that can fulfill diverse tasks on mobile devices.
Our model functions by interacting solely with the user interface (UI)
Unlike previous methods, our model operates not only on a single screen image but on vision-language sentences created from sequences of past screenshots.
arXiv Detail & Related papers (2024-04-12T18:28:44Z) - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.
Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.
We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z) - Play it by Ear: Learning Skills amidst Occlusion through Audio-Visual
Imitation Learning [62.83590925557013]
We learn a set of challenging partially-observed manipulation tasks from visual and audio inputs.
Our proposed system learns these tasks by combining offline imitation learning from tele-operated demonstrations and online finetuning.
In a set of simulated tasks, we find that our system benefits from using audio, and that by using online interventions we are able to improve the success rate of offline imitation learning by 20%.
arXiv Detail & Related papers (2022-05-30T04:52:58Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - AEGIS: A real-time multimodal augmented reality computer vision based
system to assist facial expression recognition for individuals with autism
spectrum disorder [93.0013343535411]
This paper presents the development of a multimodal augmented reality (AR) system which combines the use of computer vision and deep convolutional neural networks (CNN)
The proposed system, which we call AEGIS, is an assistive technology deployable on a variety of user devices including tablets, smartphones, video conference systems, or smartglasses.
We leverage both spatial and temporal information in order to provide an accurate expression prediction, which is then converted into its corresponding visualization and drawn on top of the original video frame.
arXiv Detail & Related papers (2020-10-22T17:20:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.