VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
- URL: http://arxiv.org/abs/2305.03204v1
- Date: Thu, 4 May 2023 23:27:21 GMT
- Title: VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
- Authors: Xilun Chen, Lili Yu, Wenhan Xiong, Barlas O\u{g}uz, Yashar Mehdad,
Wen-tau Yih
- Abstract summary: We propose a new two-stage pre-training framework for video-to-text generation tasks such as video captioning and question answering.
A generative encoder-decoder model is first jointly pre-trained on massive image-language data to learn fundamental concepts.
As a result, our VideoOFA model achieves new state-the-art performance on four Video Captioning benchmarks.
- Score: 43.90887811621963
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a new two-stage pre-training framework for video-to-text
generation tasks such as video captioning and video question answering: A
generative encoder-decoder model is first jointly pre-trained on massive
image-text data to learn fundamental vision-language concepts, and then adapted
to video data in an intermediate video-text pre-training stage to learn
video-specific skills such as spatio-temporal reasoning. As a result, our
VideoOFA model achieves new state-of-the-art performance on four Video
Captioning benchmarks, beating prior art by an average of 9.7 points in CIDEr
score. It also outperforms existing models on two open-ended Video Question
Answering datasets, showcasing its generalization capability as a universal
video-to-text model.
Related papers
- InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Analyzing Zero-Shot Abilities of Vision-Language Models on Video
Understanding Tasks [6.925770576386087]
We propose a detailed study on the generalization abilities of image-text models when evaluated on video understanding tasks in a zero-shot setting.
Our experiments show that image-text models exhibit impressive performance on video AR, video RT and video MC.
These findings shed a light on the benefits of adapting foundational image-text models to an array of video tasks while avoiding the costly pretraining step.
arXiv Detail & Related papers (2023-10-07T20:57:54Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - End-to-end Generative Pretraining for Multimodal Video Captioning [82.79187814057313]
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos.
Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly.
Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks.
arXiv Detail & Related papers (2022-01-20T16:16:21Z) - CLIP2Video: Mastering Video-Text Retrieval via Image CLIP [13.270902407320005]
We present CLIP2Video network to transfer the image-language training model to video-text retrieval in an end-to-end manner.
We conduct thorough ablation studies, and achieve state-of-the-art performance on text-to-video and video-to-text retrieval benchmarks.
arXiv Detail & Related papers (2021-06-21T13:30:33Z) - Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [80.7397409377659]
We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets.
Our model is flexible and can be trained on both image and video text datasets, either independently or in conjunction.
We show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks.
arXiv Detail & Related papers (2021-04-01T17:48:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.