Harvest Video Foundation Models via Efficient Post-Pretraining
- URL: http://arxiv.org/abs/2310.19554v1
- Date: Mon, 30 Oct 2023 14:06:16 GMT
- Title: Harvest Video Foundation Models via Efficient Post-Pretraining
- Authors: Yizhuo Li, Kunchang Li, Yinan He, Yi Wang, Yali Wang, Limin Wang, Yu
Qiao, Ping Luo
- Abstract summary: We propose an efficient framework to harvest video foundation models from image ones.
Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pretraining procedure.
Our method achieves state-of-the-art performances, which are comparable to some heavily pretrained video foundation models.
- Score: 67.30842563833185
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building video-language foundation models is costly and difficult due to the
redundant nature of video data and the lack of high-quality video-language
datasets. In this paper, we propose an efficient framework to harvest video
foundation models from image ones. Our method is intuitively simple by randomly
dropping input video patches and masking out input text during the
post-pretraining procedure. The patch dropping boosts the training efficiency
significantly and text masking enforces the learning of cross-modal fusion. We
conduct extensive experiments to validate the effectiveness of our method on a
wide range of video-language downstream tasks including various zero-shot
tasks, video question answering, and video-text retrieval. Despite its
simplicity, our method achieves state-of-the-art performances, which are
comparable to some heavily pretrained video foundation models. Our method is
extremely efficient and can be trained in less than one day on 8 GPUs,
requiring only WebVid-10M as pretraining data. We hope our method can serve as
a simple yet strong counterpart for prevalent video foundation models, provide
useful insights when building them, and make large pretrained models more
accessible and sustainable. This is part of the InternVideo project
\url{https://github.com/OpenGVLab/InternVideo}.
Related papers
- Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs.
Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z) - Unlearning Concepts from Text-to-Video Diffusion Models [4.640452023364898]
We propose a novel concept-unlearning method by transferring the unlearning capability of the text encoder of text-to-image diffusion models to text-to-video diffusion models.
Our method can unlearn copyrighted cartoon characters, artist styles, objects and people's facial characteristics.
arXiv Detail & Related papers (2024-07-19T11:15:02Z) - Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data [19.210471935816273]
We propose a novel evaluation task for video-text understanding, namely retrieval from counterfactually augmented data (RCAD) and a new Feint6K dataset.
To succeed on our new evaluation task, models must derive a comprehensive understanding of the video from cross-frame reasoning.
Our approach successfully learn more discriminative action embeddings and improves results on Feint6K when applied to multiple video-text models.
arXiv Detail & Related papers (2024-07-18T01:55:48Z) - Analyzing Zero-Shot Abilities of Vision-Language Models on Video
Understanding Tasks [6.925770576386087]
We propose a detailed study on the generalization abilities of image-text models when evaluated on video understanding tasks in a zero-shot setting.
Our experiments show that image-text models exhibit impressive performance on video AR, video RT and video MC.
These findings shed a light on the benefits of adapting foundational image-text models to an array of video tasks while avoiding the costly pretraining step.
arXiv Detail & Related papers (2023-10-07T20:57:54Z) - Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity.
This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods.
Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - Revealing Single Frame Bias for Video-and-Language Learning [115.01000652123882]
We show that a single-frame trained model can achieve better performance than existing methods that use multiple frames for training.
This result reveals the existence of a strong "static appearance bias" in popular video-and-language datasets.
We propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling.
arXiv Detail & Related papers (2022-06-07T16:28:30Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.