Egocentric Video-Language Pretraining @ Ego4D Challenge 2022
- URL: http://arxiv.org/abs/2207.01622v1
- Date: Mon, 4 Jul 2022 12:47:16 GMT
- Title: Egocentric Video-Language Pretraining @ Ego4D Challenge 2022
- Authors: Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray,
Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie
Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike
Zheng Shou
- Abstract summary: We propose a video-language pretraining solution citekevin2022egovlp for four Ego4D challenge tasks.
Based on the above three designs, we develop a pretrained video-language model that is able to transfer its egocentric video-text representation to several video downstream tasks.
- Score: 74.04740069230692
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this report, we propose a video-language pretraining (VLP) based solution
\cite{kevin2022egovlp} for four Ego4D challenge tasks, including Natural
Language Query (NLQ), Moment Query (MQ), Object State Change Classification
(OSCC), and PNR Localization (PNR). Especially, we exploit the recently
released Ego4D dataset \cite{grauman2021ego4d} to pioneer Egocentric VLP from
pretraining dataset, pretraining objective, and development set. Based on the
above three designs, we develop a pretrained video-language model that is able
to transfer its egocentric video-text representation or video-only
representation to several video downstream tasks. Our Egocentric VLP achieves
10.46R@1&IoU @0.3 on NLQ, 10.33 mAP on MQ, 74% Acc on OSCC, 0.67 sec error on
PNR. The code is available at https://github.com/showlab/EgoVLP.
Related papers
- EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation [54.32133648259802]
We present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge.
Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo.
This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions.
arXiv Detail & Related papers (2024-06-26T05:01:37Z) - ObjectNLQ @ Ego4D Episodic Memory Challenge 2024 [51.57555556405898]
We present our approach for the Natural Language Query track and Goal Step track of the Ego4D Episodic Memory Benchmark at CVPR 2024.
Both challenges require the localization of actions within long video sequences using textual queries.
We introduce a novel approach, termed ObjectNLQ, which incorporates an object branch to augment the video representation with detailed object information.
arXiv Detail & Related papers (2024-06-22T07:57:58Z) - HCQA @ Ego4D EgoSchema Challenge 2024 [51.57555556405898]
We propose a novel scheme for egocentric video Question Answering, named HCQA.
It consists of three stages: Fine-grained Caption Generation, Context-driven Summarization, and Inference-guided Answering.
On a blind test set, HCQA achieves 75% accuracy in answering over 5,000 human-choice questions.
arXiv Detail & Related papers (2024-06-22T07:20:39Z) - GroundNLQ @ Ego4D Natural Language Queries Challenge 2023 [73.12670280220992]
To accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required.
We leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations.
In addition, we introduce a novel grounding model GroundNLQ, which employs a multi-modal multi-scale grounding module.
arXiv Detail & Related papers (2023-06-27T07:27:52Z) - Exploring adaptation of VideoMAE for Audio-Visual Diarization & Social @
Ego4d Looking at me Challenge [5.429147779652134]
VideoMAE is the data-efficient pretraining model for self-supervised video pre-training.
We show that the representation transferred from VideoMAE has good Spatio-temporal modeling.
arXiv Detail & Related papers (2022-11-17T06:49:57Z) - A Simple Transformer-Based Model for Ego4D Natural Language Queries
Challenge [8.674624972031387]
This report describes our submission to the Ego4D Natural Language Queries (NLQ) Challenge.
Our solution inherits the point-based event representation from our prior work on temporal action localization, and develops a Transformer-based model for video grounding.
Without bells and whistles, our submission based on a single model achieves 12.64% Mean R@1 and is ranked 2nd on the public leaderboard.
arXiv Detail & Related papers (2022-11-16T06:33:37Z) - Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance
Retrieval Challenge 2022 [22.299810960572348]
We propose a video-language pretraining solution citekevin2022egovlp for the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge.
Our best single model obtains strong performance on the challenge test set with 47.39% mAP and 61.44% nDCG.
arXiv Detail & Related papers (2022-07-04T11:32:48Z) - Egocentric Video-Language Pretraining [74.04740069230692]
Video-Language Pretraining aims to learn transferable representation to advance a wide range of video-text downstream tasks.
We exploit the recently released Ego4D dataset to pioneer Egocentric training along three directions.
We demonstrate strong performance on five egocentric downstream tasks across three datasets.
arXiv Detail & Related papers (2022-06-03T16:28:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.