Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation
- URL: http://arxiv.org/abs/2407.19520v1
- Date: Sun, 28 Jul 2024 16:01:32 GMT
- Title: Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation
- Authors: Tz-Ying Wu, Kyle Min, Subarna Tripathi, Nuno Vasconcelos,
- Abstract summary: Ego-VPA is a parameter-efficient adaptation for egocentric video tasks.
Ego-VPA excels in lightweight adaptation with only 0.84% learnable parameters.
- Score: 57.38965505987893
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video understanding typically requires fine-tuning the large backbone when adapting to new domains. In this paper, we leverage the egocentric video foundation models (Ego-VFMs) based on video-language pre-training and propose a parameter-efficient adaptation for egocentric video tasks, namely Ego-VPA. It employs a local sparse approximation for each video frame/text feature using the basis prompts, and the selected basis prompts are used to synthesize video/text prompts. Since the basis prompts are shared across frames and modalities, it models context fusion and cross-modal transfer in an efficient fashion. Experiments show that Ego-VPA excels in lightweight adaptation (with only 0.84% learnable parameters), largely improving over baselines and reaching the performance of full fine-tuning.
Related papers
- Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization [56.75782714530429]
We propose a cross-modal adaptation framework, which we call X-MIC.
Our pipeline learns to align frozen text embeddings to each egocentric video directly in the shared embedding space.
This results in an enhanced alignment of text embeddings to each egocentric video, leading to a significant improvement in cross-dataset generalization.
arXiv Detail & Related papers (2024-03-28T19:45:35Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the
Backbone [67.13773226242242]
Video-language pre-training can generalize to various vision and language tasks.
Video-language pre-training frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning.
New generation of egocentric video-language pre-training incorporates cross-modal fusion directly into the video and language backbones.
arXiv Detail & Related papers (2023-07-11T17:50:15Z) - CLIP-It! Language-Guided Video Summarization [96.69415453447166]
This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization.
We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another.
Our model can be extended to the unsupervised setting by training without ground-truth supervision.
arXiv Detail & Related papers (2021-07-01T17:59:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.