VindLU: A Recipe for Effective Video-and-Language Pretraining
- URL: http://arxiv.org/abs/2212.05051v2
- Date: Wed, 5 Apr 2023 17:56:15 GMT
- Title: VindLU: A Recipe for Effective Video-and-Language Pretraining
- Authors: Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas
Bertasius
- Abstract summary: This paper conducts an empirical study demystifying the most important factors in the VidL model design.
Using these empirical insights, we then develop a step-by-step recipe, dubbed VindLU, for effective VidL pretraining.
Our model trained using our recipe achieves comparable or better than state-of-the-art results on several VidL tasks.
- Score: 83.49216853881595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The last several years have witnessed remarkable progress in
video-and-language (VidL) understanding. However, most modern VidL approaches
use complex and specialized model architectures and sophisticated pretraining
protocols, making the reproducibility, analysis and comparisons of these
frameworks difficult. Hence, instead of proposing yet another new VidL model,
this paper conducts a thorough empirical study demystifying the most important
factors in the VidL model design. Among the factors that we investigate are (i)
the spatiotemporal architecture design, (ii) the multimodal fusion schemes,
(iii) the pretraining objectives, (iv) the choice of pretraining data, (v)
pretraining and finetuning protocols, and (vi) dataset and model scaling. Our
empirical study reveals that the most important design factors include:
temporal modeling, video-to-text multimodal fusion, masked modeling objectives,
and joint training on images and videos. Using these empirical insights, we
then develop a step-by-step recipe, dubbed VindLU, for effective VidL
pretraining. Our final model trained using our recipe achieves comparable or
better than state-of-the-art results on several VidL tasks without relying on
external CLIP pretraining. In particular, on the text-to-video retrieval task,
our approach obtains 61.2% on DiDeMo, and 55.0% on ActivityNet, outperforming
current SOTA by 7.8% and 6.1% respectively. Furthermore, our model also obtains
state-of-the-art video question-answering results on ActivityNet-QA, MSRVTT-QA,
MSRVTT-MC and TVQA. Our code and pretrained models are publicly available at:
https://github.com/klauscc/VindLU.
Related papers
- NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning [78.23573511641548]
Vision-language pre-training has significantly elevated performance across a wide range of image-language applications.
Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources.
This paper investigates a straight-forward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for video understanding.
arXiv Detail & Related papers (2024-04-25T19:29:55Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.