Egocentric Video-Language Pretraining
- URL: http://arxiv.org/abs/2206.01670v1
- Date: Fri, 3 Jun 2022 16:28:58 GMT
- Title: Egocentric Video-Language Pretraining
- Authors: Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray,
Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie
Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike
Zheng Shou
- Abstract summary: Video-Language Pretraining aims to learn transferable representation to advance a wide range of video-text downstream tasks.
We exploit the recently released Ego4D dataset to pioneer Egocentric training along three directions.
We demonstrate strong performance on five egocentric downstream tasks across three datasets.
- Score: 74.04740069230692
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video-Language Pretraining (VLP), aiming to learn transferable representation
to advance a wide range of video-text downstream tasks, has recently received
increasing attention. Dominant works that achieve strong performance rely on
large-scale, 3rd-person video-text datasets, such as HowTo100M. In this work,
we exploit the recently released Ego4D dataset to pioneer Egocentric VLP along
three directions. (i) We create EgoClip, a 1st-person video-text pretraining
dataset comprising 3.8M clip-text pairs well-chosen from Ego4D, covering a
large variety of human daily activities. (ii) We propose a novel pretraining
objective, dubbed as EgoNCE, which adapts video-text contrastive learning to
egocentric domain by mining egocentric-aware positive and negative samples.
(iii) We introduce EgoMCQ, a development benchmark that is close to EgoClip and
hence can support effective validation and fast exploration of our design
decisions regarding EgoClip and EgoNCE. Furthermore, we demonstrate strong
performance on five egocentric downstream tasks across three datasets:
video-text retrieval on EPIC-KITCHENS-100; action recognition on Charades-Ego;
and natural language query, moment query, and object state change
classification on Ego4D challenge benchmarks. The dataset and code will be
available at https://github.com/showlab/EgoVLP.
Related papers
- EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation [30.350824860817536]
EgoVid-5M is the first high-quality dataset curated for egocentric video generation.
We introduce EgoDreamer, which is capable of generating egocentric videos driven simultaneously by action descriptions and kinematic control signals.
arXiv Detail & Related papers (2024-11-13T07:05:40Z) - EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation [54.32133648259802]
We present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge.
Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo.
This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions.
arXiv Detail & Related papers (2024-06-26T05:01:37Z) - EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions? [48.702973928321946]
We introduce a novel asymmetric contrastive objective for EgoHOI named EgoNCE++.
Our experiments demonstrate that EgoNCE++ significantly boosts open-vocabulary HOI recognition, multi-instance retrieval, and action recognition tasks.
arXiv Detail & Related papers (2024-05-28T00:27:29Z) - Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos.
We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z) - Egocentric Video-Language Pretraining @ Ego4D Challenge 2022 [74.04740069230692]
We propose a video-language pretraining solution citekevin2022egovlp for four Ego4D challenge tasks.
Based on the above three designs, we develop a pretrained video-language model that is able to transfer its egocentric video-text representation to several video downstream tasks.
arXiv Detail & Related papers (2022-07-04T12:47:16Z) - Ego-Exo: Transferring Visual Representations from Third-person to
First-person Videos [92.38049744463149]
We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets.
Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific properties.
Our experiments show that our Ego-Exo framework can be seamlessly integrated into standard video models.
arXiv Detail & Related papers (2021-04-16T06:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.