Related papers: MM-Ego: Towards Building Egocentric Multimodal LLMs

MM-Ego: Towards Building Egocentric Multimodal LLMs

URL: http://arxiv.org/abs/2410.07177v1
Date: Wed, 9 Oct 2024 17:59:59 GMT
Title: MM-Ego: Towards Building Egocentric Multimodal LLMs
Authors: Hanrong Ye, Haotian Zhang, Erik Daxberger, Lin Chen, Zongyu Lin, Yanghao Li, Bowen Zhang, Haoxuan You, Dan Xu, Zhe Gan, Jiasen Lu, Yinfei Yang,
Abstract summary: This research aims to explore building a multimodal foundation model for egocentric video understanding. We develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data. We contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths.
Score: 72.47344411599322
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This research aims to comprehensively explore building a multimodal foundation model for egocentric video understanding. To achieve this goal, we work on three fronts. First, as there is a lack of QA data for egocentric video understanding, we develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data. This is currently the largest egocentric QA dataset. Second, we contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths. We introduce a new de-biasing evaluation method to help mitigate the unavoidable language bias present in the models being evaluated. Third, we propose a specialized multimodal architecture featuring a novel "Memory Pointer Prompting" mechanism. This design includes a global glimpse step to gain an overarching understanding of the entire video and identify key visual information, followed by a fallback step that utilizes the key visual information to generate responses. This enables the model to more effectively comprehend extended video content. With the data, benchmark, and model, we successfully build MM-Ego, an egocentric multimodal LLM that shows powerful performance on egocentric video understanding.

Related papers

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT [56.24624833924252]
EgoThinker is a framework that endows MLs with robust egocentric reasoning capabilities through-temporal chain-of-thought supervision and a two-stage learning curriculum.<n>EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained-temporal localization tasks.
arXiv Detail & Related papers (2025-10-27T17:38:17Z)
EgoM2P: Egocentric Multimodal Multitask Pretraining [55.259234688003545]
Building large-scale egocentric multimodal and multitask models presents unique challenges.<n> EgoM2P is a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding.<n>We will fully open-source EgoM2P to support the community and advance egocentric vision research.
arXiv Detail & Related papers (2025-06-09T15:59:25Z)
EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos [26.930652137352197]
We introduce EgoToM, a new video question-answering benchmark that extends Theory-of-Mind evaluation to egocentric domains. Using a causal ToM model, we generate multi-choice video QA instances for the Ego4D dataset to benchmark the ability to predict a camera wearer's goals, beliefs, and next actions. We study the performance of both humans and state of the art multimodal large language models (MLLMs) on these three interconnected inference problems.
arXiv Detail & Related papers (2025-03-28T05:10:59Z)
Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding [69.96199605596138]
Current MLLMs primarily focus on third-person (exocentric) vision, overlooking the unique aspects of first-person (egocentric) videos. We propose learning the mapping between exocentric and egocentric domains to enhance egocentric video understanding. We introduce Ego-ExoClip, a pre-training dataset comprising 1.1M synchronized ego-exo clip-text pairs.
arXiv Detail & Related papers (2025-03-12T08:10:33Z)
EgoBlind: Towards Egocentric Visual Assistance for the Blind [94.45487133965361]
EgoBlind is the first egocentric VideoQA dataset collected from blind individuals.<n>It comprises 1,392 first-person videos from the daily lives of blind and visually impaired individuals.<n>It also features 5,311 questions directly posed or verified by the blind to reflect their in-situation needs for visual assistance.
arXiv Detail & Related papers (2025-03-11T09:40:31Z)
EgoLife: Towards Egocentric Life Assistant [60.51196061794498]
We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses.<n>We conduct a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references.<n>This effort resulted in the EgoLife dataset, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation.<n>We introduce EgoLifeQA, a suite of long-context, life-oriented question-answering tasks designed to provide
arXiv Detail & Related papers (2025-03-05T18:54:16Z)
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI [17.763461523794806]
VidEgoThink is a benchmark for evaluating egocentric video understanding capabilities in Embodied AI. We design four key interrelated tasks: video question-answering, hierarchy planning, visual grounding and reward modeling. We conduct extensive experiments with three types of models: API-based MLLMs, open-source image-based MLLMs, and open-source video-based MLLMs.
arXiv Detail & Related papers (2024-10-15T14:08:53Z)
EAGLE: Egocentric AGgregated Language-video Engine [34.60423566630983]
We introduce the Eagle (Egocentric AGgregated Language-video Engine) model and the Eagle-400K dataset to provide a unified framework that integrates various egocentric video understanding tasks. Egocentric video analysis brings new insights into understanding human activities and intentions from a first-person perspective.
arXiv Detail & Related papers (2024-09-26T04:17:27Z)
AMEGO: Active Memory from long EGOcentric videos [26.04157621755452]
We introduce AMEGO, a novel approach aimed at enhancing the comprehension of very-long egocentric videos. Inspired by the human's ability to maintain information from a single watching, AMEGO focuses on constructing a self-contained representations from one egocentric video. This representation is semantic-free and facilitates multiple queries without the need to reprocess the entire visual content.
arXiv Detail & Related papers (2024-09-17T06:18:47Z)
EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation [54.32133648259802]
We present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo. This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions.
arXiv Detail & Related papers (2024-06-26T05:01:37Z)
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding [44.79843213164787]
Embodied AI personal assistants require embodied understanding to collaborate with humans effectively. Current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric experience. We introduce the Egocentric Video Understanding dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. We present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD.
arXiv Detail & Related papers (2024-06-19T20:14:14Z)
CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects. We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z)
How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z)
WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning [49.72868038180909]
We present WorldQA, a video dataset designed to push the boundaries of multimodal world models. We identify five essential types of world knowledge for question formulation. We introduce WorldRetriever, an agent designed to synthesize expert knowledge into a coherent reasoning chain.
arXiv Detail & Related papers (2024-05-06T08:42:34Z)
Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos. We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z)
Grounded Question-Answering in Long Egocentric Videos [39.281013854331285]
open-ended question-answering (QA) in long, egocentric videos allows individuals or robots to inquire about their own past visual experiences. This task presents unique challenges, including the complexity of temporally grounding queries within extensive video content. Our proposed approach tackles these challenges by (i) integrating query grounding and answering within a unified model to reduce error propagation.
arXiv Detail & Related papers (2023-12-11T16:31:55Z)
Egocentric Video-Language Pretraining [74.04740069230692]
Video-Language Pretraining aims to learn transferable representation to advance a wide range of video-text downstream tasks. We exploit the recently released Ego4D dataset to pioneer Egocentric training along three directions. We demonstrate strong performance on five egocentric downstream tasks across three datasets.
arXiv Detail & Related papers (2022-06-03T16:28:58Z)
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.