From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living Activities
- URL: http://arxiv.org/abs/2501.05711v2
- Date: Tue, 25 Mar 2025 17:59:08 GMT
- Title: From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living Activities
- Authors: Dominick Reilly, Manish Kumar Govind, Le Xue, Srijan Das,
- Abstract summary: We aim to leverage the complementary nature of egocentric views to enhance LVLM's understanding of exocentric ADL videos.<n>While effective, this approach requires paired ego-exo videos, which are impractical to collect at scale.<n>To enhance the ego representation of LVLMs trained on synthetic data, we develop a domain-a bootstrapped ego2exognostic strategy.
- Score: 7.952665773362793
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision Language Models (LVLMs) have demonstrated impressive capabilities in video understanding, yet their adoption for Activities of Daily Living (ADL) remains limited by their inability to capture fine-grained interactions and spatial relationships. To address this, we aim to leverage the complementary nature of egocentric views to enhance LVLM's understanding of exocentric ADL videos. Consequently, we propose ego2exo knowledge distillation to learn ego-augmented exp representations. While effective, this approach requires paired ego-exo videos, which are impractical to collect at scale. To address this, we propose Skeleton-guided Synthetic Ego Generation (SK-EGO), which leverages human skeleton motion to generate synthetic ego views from exocentric videos. To enhance the ego representation of LVLMs trained on synthetic data, we develop a domain-agnostic bootstrapped ego2exo strategy that effectively transfers knowledge from real ego-exo pairs to synthetic ego-exo pairs, while mitigating domain misalignment. We find that the exo representations of our ego-augmented LVLMs successfully learn to extract ego-perspective cues, demonstrated through comprehensive evaluation on six ADL benchmarks and our proposed Ego-in-Exo PerceptionMCQ benchmark designed specifically to assess egocentric understanding from exocentric videos. Code, models, and data will be open-sourced at https://github.com/dominickrei/EgoExo4ADL.
Related papers
- EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos [49.24266108952835]
Given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate futur frames of the ego-centric video.
EgoExo-Gen explicitly models the hand-object dynamics for cross-view video prediction.
arXiv Detail & Related papers (2025-04-16T03:12:39Z) - Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding [69.96199605596138]
Current MLLMs primarily focus on third-person (exocentric) vision, overlooking the unique aspects of first-person (egocentric) videos.
We propose learning the mapping between exocentric and egocentric domains to enhance egocentric video understanding.
We introduce Ego-ExoClip, a pre-training dataset comprising 1.1M synchronized ego-exo clip-text pairs.
arXiv Detail & Related papers (2025-03-12T08:10:33Z) - EgoMe: A New Dataset and Challenge for Following Me via Egocentric View in Real World [12.699670048897085]
In human imitation learning, the imitator typically take the egocentric view as a benchmark, naturally transferring behaviors observed from an exocentric view to their owns.
We introduce EgoMe, which towards following the process of human imitation learning via the imitator's egocentric view in the real world.
Our dataset includes 7902 paired exo-ego videos spanning diverse daily behaviors in various real-world scenarios.
arXiv Detail & Related papers (2025-01-31T11:48:22Z) - Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning [80.37314291927889]
We present EMBED, a method designed to transform exocentric video-language data for egocentric video representation learning.
Egocentric videos predominantly feature close-up hand-object interactions, whereas exocentric videos offer a broader perspective on human activities.
By applying both vision and language style transfer, our framework creates a new egocentric dataset.
arXiv Detail & Related papers (2024-08-07T06:10:45Z) - EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions? [48.702973928321946]
We introduce a novel asymmetric contrastive objective for EgoHOI named EgoNCE++.
Our experiments demonstrate that EgoNCE++ significantly boosts open-vocabulary HOI recognition, multi-instance retrieval, and action recognition tasks.
arXiv Detail & Related papers (2024-05-28T00:27:29Z) - EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World [44.34800426136217]
We introduce EgoExoLearn, a dataset that emulates the human demonstration following process.
EgoExoLearn contains egocentric and demonstration video data spanning 120 hours.
We present benchmarks such as cross-view association, cross-view action planning, and cross-view referenced skill assessment.
arXiv Detail & Related papers (2024-03-24T15:00:44Z) - Intention-driven Ego-to-Exo Video Generation [16.942040396018736]
Ego-to-exo video generation refers to generating the corresponding exo-ego video according to the egocentric model.
This paper proposes an Intention-Driven-exo generation framework (IDE) that leverages action description as view-independent representation.
We conduct experiments on the relevant dataset with diverse exo-ego video pairs, and ourIDE outperforms state-of-the-art models in both subjective and objective assessments.
arXiv Detail & Related papers (2024-03-14T09:07:31Z) - Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos.
We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z) - EgoDistill: Egocentric Head Motion Distillation for Efficient Video
Understanding [90.9111678470214]
We propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features.
Our method leads to significant improvements in efficiency, requiring 200x fewer GFLOPs than equivalent video models.
We demonstrate its effectiveness on the Ego4D and EPICKitchens datasets, where our method outperforms state-of-the-art efficient video understanding methods.
arXiv Detail & Related papers (2023-01-05T18:39:23Z) - Ego-Exo: Transferring Visual Representations from Third-person to
First-person Videos [92.38049744463149]
We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets.
Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific properties.
Our experiments show that our Ego-Exo framework can be seamlessly integrated into standard video models.
arXiv Detail & Related papers (2021-04-16T06:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.