EgoDistill: Egocentric Head Motion Distillation for Efficient Video
Understanding
- URL: http://arxiv.org/abs/2301.02217v1
- Date: Thu, 5 Jan 2023 18:39:23 GMT
- Title: EgoDistill: Egocentric Head Motion Distillation for Efficient Video
Understanding
- Authors: Shuhan Tan, Tushar Nagarajan, Kristen Grauman
- Abstract summary: We propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features.
Our method leads to significant improvements in efficiency, requiring 200x fewer GFLOPs than equivalent video models.
We demonstrate its effectiveness on the Ego4D and EPICKitchens datasets, where our method outperforms state-of-the-art efficient video understanding methods.
- Score: 90.9111678470214
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in egocentric video understanding models are promising, but
their heavy computational expense is a barrier for many real-world
applications. To address this challenge, we propose EgoDistill, a
distillation-based approach that learns to reconstruct heavy egocentric video
clip features by combining the semantics from a sparse set of video frames with
the head motion from lightweight IMU readings. We further devise a novel
self-supervised training strategy for IMU feature learning. Our method leads to
significant improvements in efficiency, requiring 200x fewer GFLOPs than
equivalent video models. We demonstrate its effectiveness on the Ego4D and
EPICKitchens datasets, where our method outperforms state-of-the-art efficient
video understanding methods.
Related papers
- MM-Ego: Towards Building Egocentric Multimodal LLMs [72.47344411599322]
This research aims to explore building a multimodal foundation model for egocentric video understanding.
We develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data.
We contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths.
arXiv Detail & Related papers (2024-10-09T17:59:59Z) - EAGLE: Egocentric AGgregated Language-video Engine [34.60423566630983]
We introduce the Eagle (Egocentric AGgregated Language-video Engine) model and the Eagle-400K dataset to provide a unified framework that integrates various egocentric video understanding tasks.
Egocentric video analysis brings new insights into understanding human activities and intentions from a first-person perspective.
arXiv Detail & Related papers (2024-09-26T04:17:27Z) - VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens.
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z) - Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation [57.38965505987893]
Ego-VPA is a parameter-efficient adaptation for egocentric video tasks.
Ego-VPA excels in lightweight adaptation with only 0.84% learnable parameters.
arXiv Detail & Related papers (2024-07-28T16:01:32Z) - DL-KDD: Dual-Light Knowledge Distillation for Action Recognition in the Dark [2.941253902145271]
We propose a teacher-student video classification framework, named Dual-Light KnowleDge Distillation for Action Recognition in the Dark (DL-KDD)
This framework enables the model to learn from both original and enhanced video without introducing additional computational cost during inference.
In our experiments, the proposed DL-KDD framework outperforms state-of-the-art methods on the ARID, ARID V1.5, and Dark-48 datasets.
arXiv Detail & Related papers (2024-06-04T16:38:06Z) - EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions? [48.702973928321946]
We introduce a novel asymmetric contrastive objective for EgoHOI named EgoNCE++.
Our experiments demonstrate that EgoNCE++ significantly boosts open-vocabulary HOI recognition, multi-instance retrieval, and action recognition tasks.
arXiv Detail & Related papers (2024-05-28T00:27:29Z) - Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos.
We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z) - LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning [27.661804052577825]
We introduce a novel problem -- egocentric action frame generation.
The goal is to synthesize an image depicting an action in the user's context (i.e., action frame) by conditioning on a user prompt and an input egocentric image.
arXiv Detail & Related papers (2023-12-06T19:02:40Z) - E-ViLM: Efficient Video-Language Model via Masked Video Modeling with
Semantic Vector-Quantized Tokenizer [5.7254320553764]
E-ViLM is able to learn expressive representations from Video-Language corpus and generalize well to extensive Video-Language tasks.
Our model reaches $39.3$% Top-$1$ accuracy on the MSRVTT benchmark, retaining $91.4$% of the accuracy of state-of-the-art larger VL architecture.
arXiv Detail & Related papers (2023-11-28T22:57:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.