An Egocentric Vision-Language Model based Portable Real-time Smart Assistant
- URL: http://arxiv.org/abs/2503.04250v1
- Date: Thu, 06 Mar 2025 09:33:46 GMT
- Title: An Egocentric Vision-Language Model based Portable Real-time Smart Assistant
- Authors: Yifei Huang, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Mingfang Zhang, Lijin Yang, Zheng Nie, Jinyao Liu, Guoshun Fan, Dechen Lin, Fang Fang, Kunpeng Li, Chang Yuan, Xinyuan Chen, Yaohui Wang, Yali Wang, Yu Qiao, Limin Wang,
- Abstract summary: We present Vinci, a vision-language system designed to provide real-time, comprehensive AI assistance on portable devices.<n>At its core, Vinci leverages EgoVideo-VL, a novel model that integrates an egocentric vision foundation model with a large language model.<n>vinci is hardware-agnostic, supporting deployment across a wide range of devices, including smartphones and wearable cameras.
- Score: 50.324455115241186
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Vinci, a vision-language system designed to provide real-time, comprehensive AI assistance on portable devices. At its core, Vinci leverages EgoVideo-VL, a novel model that integrates an egocentric vision foundation model with a large language model (LLM), enabling advanced functionalities such as scene understanding, temporal grounding, video summarization, and future planning. To enhance its utility, Vinci incorporates a memory module for processing long video streams in real time while retaining contextual history, a generation module for producing visual action demonstrations, and a retrieval module that bridges egocentric and third-person perspectives to provide relevant how-to videos for skill acquisition. Unlike existing systems that often depend on specialized hardware, Vinci is hardware-agnostic, supporting deployment across a wide range of devices, including smartphones and wearable cameras. In our experiments, we first demonstrate the superior performance of EgoVideo-VL on multiple public benchmarks, showcasing its vision-language reasoning and contextual understanding capabilities. We then conduct a series of user studies to evaluate the real-world effectiveness of Vinci, highlighting its adaptability and usability in diverse scenarios. We hope Vinci can establish a new framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights. Including the frontend, backend, and models, all codes of Vinci are available at https://github.com/OpenGVLab/vinci.
Related papers
- Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model [49.90916095152366]
We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-language model.<n>vinci operates in an "always on" mode, continuously observing the environment to deliver seamless interaction and assistance.<n>We release the complete implementation for the development of the device in conjunction with a demo web platform to test uploaded videos.
arXiv Detail & Related papers (2024-12-30T16:57:05Z) - Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing [150.0380447353081]
We present VITRON, a universal pixel-level vision LLM designed for comprehensive understanding, segmenting, and clusters of both static images and dynamic videos.
Building on top of an LLM, VITRON incorporates encoders for images, videos, and pixel-level regional visuals within its modules, while employing state-of-the-art visual specialists as its backend.
arXiv Detail & Related papers (2024-10-08T08:39:04Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.<n>Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.<n>We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Towards Multi-Task Multi-Modal Models: A Video Generative Perspective [5.495245220300184]
This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions.
We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms.
Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
arXiv Detail & Related papers (2024-05-26T23:56:45Z) - How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES)
CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.
Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z) - Video as the New Language for Real-World Decision Making [100.68643056416394]
Video data captures important information about the physical world that is difficult to express in language.
Video can serve as a unified interface that can absorb internet knowledge and represent diverse tasks.
We identify major impact opportunities in domains such as robotics, self-driving, and science.
arXiv Detail & Related papers (2024-02-27T02:05:29Z) - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.<n>Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.<n>We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z) - Towards Long-Form Video Understanding [7.962725903399016]
We introduce a framework for modeling long-form videos and develop evaluation protocols on large-scale datasets.
A novel object-centric transformer-based video recognition architecture performs significantly better on 7 diverse tasks.
arXiv Detail & Related papers (2021-06-21T17:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.