InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges
- URL: http://arxiv.org/abs/2211.09529v1
- Date: Thu, 17 Nov 2022 13:45:06 GMT
- Title: InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges
- Authors: Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu,
Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, Zhiyu Zhao, Junting Pan, Yifei
Huang, Zun Wang, Jiashuo Yu, Yinan He, Hongjie Zhang, Tong Lu, Yali Wang,
Limin Wang, Yu Qiao
- Abstract summary: We present our champion solutions to five tracks at Ego4D challenge.
We leverage our developed InternVideo, a video foundation model, for five Ego4D tasks.
InternVideo-Ego4D is an effective paradigm to adapt the strong foundation model to the downstream ego-centric video understanding tasks.
- Score: 66.62885923201543
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this report, we present our champion solutions to five tracks at Ego4D
challenge. We leverage our developed InternVideo, a video foundation model, for
five Ego4D tasks, including Moment Queries, Natural Language Queries, Future
Hand Prediction, State Change Object Detection, and Short-term Object
Interaction Anticipation. InternVideo-Ego4D is an effective paradigm to adapt
the strong foundation model to the downstream ego-centric video understanding
tasks with simple head designs. In these five tasks, the performance of
InternVideo-Ego4D comprehensively surpasses the baseline methods and the
champions of CVPR2022, demonstrating the powerful representation ability of
InternVideo as a video foundation model. Our code will be released at
https://github.com/OpenGVLab/ego4d-eccv2022-solutions
Related papers
- EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation [54.32133648259802]
We present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge.
Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo.
This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions.
arXiv Detail & Related papers (2024-06-26T05:01:37Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - VideoPrism: A Foundational Visual Encoder for Video Understanding [90.01845485201746]
VideoPrism is a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model.
We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text.
We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks.
arXiv Detail & Related papers (2024-02-20T18:29:49Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - Exploring adaptation of VideoMAE for Audio-Visual Diarization & Social @
Ego4d Looking at me Challenge [5.429147779652134]
VideoMAE is the data-efficient pretraining model for self-supervised video pre-training.
We show that the representation transferred from VideoMAE has good Spatio-temporal modeling.
arXiv Detail & Related papers (2022-11-17T06:49:57Z) - Egocentric Video-Language Pretraining [74.04740069230692]
Video-Language Pretraining aims to learn transferable representation to advance a wide range of video-text downstream tasks.
We exploit the recently released Ego4D dataset to pioneer Egocentric training along three directions.
We demonstrate strong performance on five egocentric downstream tasks across three datasets.
arXiv Detail & Related papers (2022-06-03T16:28:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.