Related papers: EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding

EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding

URL: http://arxiv.org/abs/2301.02217v1
Date: Thu, 5 Jan 2023 18:39:23 GMT
Title: EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding
Authors: Shuhan Tan, Tushar Nagarajan, Kristen Grauman
Abstract summary: We propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features. Our method leads to significant improvements in efficiency, requiring 200x fewer GFLOPs than equivalent video models. We demonstrate its effectiveness on the Ego4D and EPICKitchens datasets, where our method outperforms state-of-the-art efficient video understanding methods.
Score: 90.9111678470214
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in egocentric video understanding models are promising, but their heavy computational expense is a barrier for many real-world applications. To address this challenge, we propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features by combining the semantics from a sparse set of video frames with the head motion from lightweight IMU readings. We further devise a novel self-supervised training strategy for IMU feature learning. Our method leads to significant improvements in efficiency, requiring 200x fewer GFLOPs than equivalent video models. We demonstrate its effectiveness on the Ego4D and EPICKitchens datasets, where our method outperforms state-of-the-art efficient video understanding methods.

Related papers

Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding [69.96199605596138]
Current MLLMs primarily focus on third-person (exocentric) vision, overlooking the unique aspects of first-person (egocentric) videos. We propose learning the mapping between exocentric and egocentric domains to enhance egocentric video understanding. We introduce Ego-ExoClip, a pre-training dataset comprising 1.1M synchronized ego-exo clip-text pairs.
arXiv Detail & Related papers (2025-03-12T08:10:33Z)
COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition [3.271109623410664]
We propose COMODO, a cross-modal self-supervised distillation framework that transfers rich semantic knowledge from the video modality to the IMU modality without requiring labeled annotations. Our approach enables the IMU encoder to inherit rich semantic information from video while preserving its efficiency for real-world applications.
arXiv Detail & Related papers (2025-03-10T12:43:51Z)
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning [71.02843679746563]
In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature. In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process. We propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information.
arXiv Detail & Related papers (2025-03-02T18:49:48Z)
MM-Ego: Towards Building Egocentric Multimodal LLMs [72.47344411599322]
This research aims to explore building a multimodal foundation model for egocentric video understanding. We develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data. We contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths.
arXiv Detail & Related papers (2024-10-09T17:59:59Z)
EAGLE: Egocentric AGgregated Language-video Engine [34.60423566630983]
We introduce the Eagle (Egocentric AGgregated Language-video Engine) model and the Eagle-400K dataset to provide a unified framework that integrates various egocentric video understanding tasks. Egocentric video analysis brings new insights into understanding human activities and intentions from a first-person perspective.
arXiv Detail & Related papers (2024-09-26T04:17:27Z)
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z)
Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation [57.38965505987893]
Ego-VPA is a parameter-efficient adaptation for egocentric video tasks. Ego-VPA excels in lightweight adaptation with only 0.84% learnable parameters.
arXiv Detail & Related papers (2024-07-28T16:01:32Z)
DL-KDD: Dual-Light Knowledge Distillation for Action Recognition in the Dark [2.941253902145271]
We propose a teacher-student video classification framework, named Dual-Light KnowleDge Distillation for Action Recognition in the Dark (DL-KDD) This framework enables the model to learn from both original and enhanced video without introducing additional computational cost during inference. In our experiments, the proposed DL-KDD framework outperforms state-of-the-art methods on the ARID, ARID V1.5, and Dark-48 datasets.
arXiv Detail & Related papers (2024-06-04T16:38:06Z)
EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions? [48.702973928321946]
We introduce a novel asymmetric contrastive objective for EgoHOI named EgoNCE++. Our experiments demonstrate that EgoNCE++ significantly boosts open-vocabulary HOI recognition, multi-instance retrieval, and action recognition tasks.
arXiv Detail & Related papers (2024-05-28T00:27:29Z)
Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos. We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z)
LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning [27.661804052577825]
We introduce a novel problem -- egocentric action frame generation. The goal is to synthesize an image depicting an action in the user's context (i.e., action frame) by conditioning on a user prompt and an input egocentric image.
arXiv Detail & Related papers (2023-12-06T19:02:40Z)
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer [5.7254320553764]
E-ViLM is able to learn expressive representations from Video-Language corpus and generalize well to extensive Video-Language tasks. Our model reaches $39.3$% Top-$1$ accuracy on the MSRVTT benchmark, retaining $91.4$% of the accuracy of state-of-the-art larger VL architecture.
arXiv Detail & Related papers (2023-11-28T22:57:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.