InternVideo: General Video Foundation Models via Generative and
Discriminative Learning
- URL: http://arxiv.org/abs/2212.03191v2
- Date: Wed, 7 Dec 2022 12:20:55 GMT
- Title: InternVideo: General Video Foundation Models via Generative and
Discriminative Learning
- Authors: Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao,
Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan,
Jiashuo Yu, Yali Wang, Limin Wang, Yu Qiao
- Abstract summary: We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
- Score: 52.69422763715118
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The foundation models have recently shown excellent performance on a variety
of downstream tasks in computer vision. However, most existing vision
foundation models simply focus on image-level pretraining and adpation, which
are limited for dynamic and complex video-level understanding tasks. To fill
the gap, we present general video foundation models, InternVideo, by taking
advantage of both generative and discriminative self-supervised video learning.
Specifically, InternVideo efficiently explores masked video modeling and
video-language contrastive learning as the pretraining objectives, and
selectively coordinates video representations of these two complementary
frameworks in a learnable manner to boost various video applications. Without
bells and whistles, InternVideo achieves state-of-the-art performance on 39
video datasets from extensive tasks including video action
recognition/detection, video-language alignment, and open-world video
applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy
on the challenging Kinetics-400 and Something-Something V2 benchmarks,
respectively. All of these results effectively show the generality of our
InternVideo for video understanding. The code will be released at
https://github.com/OpenGVLab/InternVideo .
Related papers
- InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Harvest Video Foundation Models via Efficient Post-Pretraining [67.30842563833185]
We propose an efficient framework to harvest video foundation models from image ones.
Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pretraining procedure.
Our method achieves state-of-the-art performances, which are comparable to some heavily pretrained video foundation models.
arXiv Detail & Related papers (2023-10-30T14:06:16Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation [43.90887811621963]
We propose a new two-stage pre-training framework for video-to-text generation tasks such as video captioning and question answering.
A generative encoder-decoder model is first jointly pre-trained on massive image-language data to learn fundamental concepts.
As a result, our VideoOFA model achieves new state-the-art performance on four Video Captioning benchmarks.
arXiv Detail & Related papers (2023-05-04T23:27:21Z) - Broaden Your Views for Self-Supervised Video Learning [97.52216510672251]
We introduce BraVe, a self-supervised learning framework for video.
In BraVe, one of the views has access to a narrow temporal window of the video while the other view has a broad access to the video content.
We demonstrate that BraVe achieves state-of-the-art results in self-supervised representation learning on standard video and audio classification benchmarks.
arXiv Detail & Related papers (2021-03-30T17:58:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.