X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization
- URL: http://arxiv.org/abs/2403.19811v1
- Date: Thu, 28 Mar 2024 19:45:35 GMT
- Title: X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization
- Authors: Anna Kukleva, Fadime Sener, Edoardo Remelli, Bugra Tekin, Eric Sauser, Bernt Schiele, Shugao Ma,
- Abstract summary: We propose a cross-modal adaptation framework, which we call X-MIC.
Our pipeline learns to align frozen text embeddings to each egocentric video directly in the shared embedding space.
This results in an enhanced alignment of text embeddings to each egocentric video, leading to a significant improvement in cross-dataset generalization.
- Score: 56.75782714530429
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lately, there has been growing interest in adapting vision-language models (VLMs) to image and third-person video classification due to their success in zero-shot recognition. However, the adaptation of these models to egocentric videos has been largely unexplored. To address this gap, we propose a simple yet effective cross-modal adaptation framework, which we call X-MIC. Using a video adapter, our pipeline learns to align frozen text embeddings to each egocentric video directly in the shared embedding space. Our novel adapter architecture retains and improves generalization of the pre-trained VLMs by disentangling learnable temporal modeling and frozen visual encoder. This results in an enhanced alignment of text embeddings to each egocentric video, leading to a significant improvement in cross-dataset generalization. We evaluate our approach on the Epic-Kitchens, Ego4D, and EGTEA datasets for fine-grained cross-dataset action generalization, demonstrating the effectiveness of our method. Code is available at https://github.com/annusha/xmic
Related papers
- Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation [57.38965505987893]
Ego-VPA is a parameter-efficient adaptation for egocentric video tasks.
Ego-VPA excels in lightweight adaptation with only 0.84% learnable parameters.
arXiv Detail & Related papers (2024-07-28T16:01:32Z) - Object Aware Egocentric Online Action Detection [23.504280692701272]
We introduce an Object-Aware Module that integrates egocentric-specific priors into existing Online Action Detection frameworks.
Our work can be seamlessly integrated into existing models with minimal overhead and bring consistent performance enhancements.
arXiv Detail & Related papers (2024-06-03T07:58:40Z) - EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions? [48.702973928321946]
We introduce a novel asymmetric contrastive objective for EgoHOI named EgoNCE++.
Our experiments demonstrate that EgoNCE++ significantly boosts open-vocabulary HOI recognition, multi-instance retrieval, and action recognition tasks.
arXiv Detail & Related papers (2024-05-28T00:27:29Z) - Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - Object-based (yet Class-agnostic) Video Domain Adaptation [78.34712426922519]
We present Object-based (yet Class-agnostic) Video Domain Adaptation (ODAPT)
ODAPT is a simple yet effective framework for adapting the existing action recognition systems to new domains.
Our model achieves a +6.5 increase when adapting across kitchens in Epic-Kitchens and a +3.1 increase adapting between Epic-Kitchens and the EGTEA dataset.
arXiv Detail & Related papers (2023-11-29T01:17:38Z) - Text-image Alignment for Diffusion-based Perception [12.98777134700767]
Diffusion models are generative models with impressive text-to-image synthesis capabilities.
It is unclear how to use the prompting interface when applying diffusion backbones to vision tasks.
We find that automatically generated captions can improve text-image alignment and significantly enhance a model's cross-attention maps.
arXiv Detail & Related papers (2023-09-29T05:16:41Z) - Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision
and Language Models [67.31684040281465]
We present textbfMOV, a simple yet effective method for textbfMultimodal textbfOpen-textbfVocabulary video classification.
In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram.
arXiv Detail & Related papers (2022-07-15T17:59:11Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.