MM-Ego: Towards Building Egocentric Multimodal LLMs
- URL: http://arxiv.org/abs/2410.07177v1
- Date: Wed, 9 Oct 2024 17:59:59 GMT
- Title: MM-Ego: Towards Building Egocentric Multimodal LLMs
- Authors: Hanrong Ye, Haotian Zhang, Erik Daxberger, Lin Chen, Zongyu Lin, Yanghao Li, Bowen Zhang, Haoxuan You, Dan Xu, Zhe Gan, Jiasen Lu, Yinfei Yang,
- Abstract summary: This research aims to explore building a multimodal foundation model for egocentric video understanding.
We develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data.
We contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths.
- Score: 72.47344411599322
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This research aims to comprehensively explore building a multimodal foundation model for egocentric video understanding. To achieve this goal, we work on three fronts. First, as there is a lack of QA data for egocentric video understanding, we develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data. This is currently the largest egocentric QA dataset. Second, we contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths. We introduce a new de-biasing evaluation method to help mitigate the unavoidable language bias present in the models being evaluated. Third, we propose a specialized multimodal architecture featuring a novel "Memory Pointer Prompting" mechanism. This design includes a global glimpse step to gain an overarching understanding of the entire video and identify key visual information, followed by a fallback step that utilizes the key visual information to generate responses. This enables the model to more effectively comprehend extended video content. With the data, benchmark, and model, we successfully build MM-Ego, an egocentric multimodal LLM that shows powerful performance on egocentric video understanding.
Related papers
- VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI [17.763461523794806]
VidEgoThink is a benchmark for evaluating egocentric video understanding capabilities in Embodied AI.
We design four key interrelated tasks: video question-answering, hierarchy planning, visual grounding and reward modeling.
We conduct extensive experiments with three types of models: API-based MLLMs, open-source image-based MLLMs, and open-source video-based MLLMs.
arXiv Detail & Related papers (2024-10-15T14:08:53Z) - EAGLE: Egocentric AGgregated Language-video Engine [34.60423566630983]
We introduce the Eagle (Egocentric AGgregated Language-video Engine) model and the Eagle-400K dataset to provide a unified framework that integrates various egocentric video understanding tasks.
Egocentric video analysis brings new insights into understanding human activities and intentions from a first-person perspective.
arXiv Detail & Related papers (2024-09-26T04:17:27Z) - AMEGO: Active Memory from long EGOcentric videos [26.04157621755452]
We introduce AMEGO, a novel approach aimed at enhancing the comprehension of very-long egocentric videos.
Inspired by the human's ability to maintain information from a single watching, AMEGO focuses on constructing a self-contained representations from one egocentric video.
This representation is semantic-free and facilitates multiple queries without the need to reprocess the entire visual content.
arXiv Detail & Related papers (2024-09-17T06:18:47Z) - AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding [44.79843213164787]
Embodied AI personal assistants require embodied understanding to collaborate with humans effectively.
Current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric experience.
We introduce the Egocentric Video Understanding dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos.
We present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD.
arXiv Detail & Related papers (2024-06-19T20:14:14Z) - CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding.
Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects.
We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z) - How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES)
CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.
Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z) - WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning [49.72868038180909]
We present WorldQA, a video dataset designed to push the boundaries of multimodal world models.
We identify five essential types of world knowledge for question formulation.
We introduce WorldRetriever, an agent designed to synthesize expert knowledge into a coherent reasoning chain.
arXiv Detail & Related papers (2024-05-06T08:42:34Z) - Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos.
We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z) - Grounded Question-Answering in Long Egocentric Videos [39.281013854331285]
open-ended question-answering (QA) in long, egocentric videos allows individuals or robots to inquire about their own past visual experiences.
This task presents unique challenges, including the complexity of temporally grounding queries within extensive video content.
Our proposed approach tackles these challenges by (i) integrating query grounding and answering within a unified model to reduce error propagation.
arXiv Detail & Related papers (2023-12-11T16:31:55Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.