Related papers: AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding

AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding

URL: http://arxiv.org/abs/2406.13807v2
Date: Fri, 21 Jun 2024 09:53:41 GMT
Title: AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding
Authors: Alessandro Suglia, Claudio Greco, Katie Baker, Jose L. Part, Ioannis Papaioannou, Arash Eshghi, Ioannis Konstas, Oliver Lemon,
Abstract summary: Embodied AI personal assistants require embodied understanding to collaborate with humans effectively. Current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric experience. We introduce the Egocentric Video Understanding dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. We present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD.
Score: 44.79843213164787
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. Second, we present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD. Finally, we evaluate AlanaVLM's capabilities on OpenEQA, a challenging benchmark for embodied video question answering. Our model achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%. Additionally, we outperform Claude 3 and Gemini Pro Vision 1.0 and showcase competitive results compared to Gemini Pro 1.5 and GPT-4V, even surpassing the latter in spatial reasoning. This research paves the way for building efficient VLMs that can be deployed in robots or wearables, leveraging embodied video understanding to collaborate seamlessly with humans in everyday tasks, contributing to the next generation of Embodied AI.

Related papers

EgoVLM: Policy Optimization for Egocentric Video Understanding [2.397572703240721]
We introduce EgoVLM, a vision-language model specifically designed to integrate visual comprehension and spatial-temporal reasoning.<n>EgoVLM is fine-tuned via Group Relative Policy Optimization (GRPO), a reinforcement learning method adapted to align model outputs with human-like reasoning steps.<n>Our EgoVLMB, trained exclusively on non-CoT egocentric data, outperforms the base Qwen2.5-VL 3B and 7B models by 14.33 and 13.87 accuracy points on the Ego benchmark, respectively.
arXiv Detail & Related papers (2025-06-03T17:28:00Z)
Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding [69.96199605596138]
Current MLLMs primarily focus on third-person (exocentric) vision, overlooking the unique aspects of first-person (egocentric) videos. We propose learning the mapping between exocentric and egocentric domains to enhance egocentric video understanding. We introduce Ego-ExoClip, a pre-training dataset comprising 1.1M synchronized ego-exo clip-text pairs.
arXiv Detail & Related papers (2025-03-12T08:10:33Z)
Probing Visual Language Priors in VLMs [51.016683265437536]
We introduce ViLP, a benchmark featuring deliberately out-of-distribution images. Each question in ViLP is coupled with three potential answers and three corresponding images. We propose a self-improving framework in which models generate new VQA data, then apply pixel-level and semantic corruptions to form "good-bad" image pairs for self-training.
arXiv Detail & Related papers (2024-12-31T17:54:29Z)
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding [21.619878862869754]
Embodied VideoAgent constructs scene memory from both egocentric video and embodied sensory inputs. We have demonstrated its potential in various embodied AI tasks including generating embodied interactions and perception for robot manipulation.
arXiv Detail & Related papers (2024-12-31T09:22:38Z)
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI [17.763461523794806]
VidEgoThink is a benchmark for evaluating egocentric video understanding capabilities in Embodied AI. We design four key interrelated tasks: video question-answering, hierarchy planning, visual grounding and reward modeling. We conduct extensive experiments with three types of models: API-based MLLMs, open-source image-based MLLMs, and open-source video-based MLLMs.
arXiv Detail & Related papers (2024-10-15T14:08:53Z)
MM-Ego: Towards Building Egocentric Multimodal LLMs [72.47344411599322]
This research aims to explore building a multimodal foundation model for egocentric video understanding. We develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data. We contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths.
arXiv Detail & Related papers (2024-10-09T17:59:59Z)
EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation [54.32133648259802]
We present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo. This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions.
arXiv Detail & Related papers (2024-06-26T05:01:37Z)
GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition [48.686183248092476]
GPT4Ego is a straightforward yet remarkably potent VLM framework for ZS-EAR. We show GPT4Ego significantly outperforms existing VLMs on three large-scale egocentric video benchmarks.
arXiv Detail & Related papers (2024-01-18T15:04:46Z)
Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos. We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z)
EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models [21.410065053609877]
Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. EgoThink is a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions.
arXiv Detail & Related papers (2023-11-27T07:44:25Z)
Egocentric Video-Language Pretraining [74.04740069230692]
Video-Language Pretraining aims to learn transferable representation to advance a wide range of video-text downstream tasks. We exploit the recently released Ego4D dataset to pioneer Egocentric training along three directions. We demonstrate strong performance on five egocentric downstream tasks across three datasets.
arXiv Detail & Related papers (2022-06-03T16:28:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.