End-to-end Generative Pretraining for Multimodal Video Captioning
- URL: http://arxiv.org/abs/2201.08264v1
- Date: Thu, 20 Jan 2022 16:16:21 GMT
- Title: End-to-end Generative Pretraining for Multimodal Video Captioning
- Authors: Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid
- Abstract summary: We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos.
Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly.
Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks.
- Score: 82.79187814057313
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent video and language pretraining frameworks lack the ability to generate
sentences. We present Multimodal Video Generative Pretraining (MV-GPT), a new
pretraining framework for learning from unlabelled videos which can be
effectively used for generative tasks such as multimodal video captioning.
Unlike recent video-language pretraining frameworks, our framework trains both
a multimodal video encoder and a sentence decoder jointly. To overcome the lack
of captions in unlabelled videos, we leverage the future utterance as an
additional text source and propose a bidirectional generation objective -- we
generate future utterances given the present mulitmodal context, and also the
present utterance given future observations. With this objective, we train an
encoder-decoder model end-to-end to generate a caption from raw pixels and
transcribed speech directly. Our model achieves state-of-the-art performance
for multimodal video captioning on four standard benchmarks, as well as for
other video understanding tasks such as VideoQA, video retrieval and action
classification.
Related papers
- InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user.
Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process.
Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z) - Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models.
We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases.
We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z) - Masking Modalities for Cross-modal Video Retrieval [93.10669981708878]
A common strategy for pre-training video encoders is to use the accompanying speech as weak supervision.
We propose to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech.
We show the superior performance of our "modality masking" pre-training approach for video retrieval on the How2R, YouCook2 and Condensed Movies datasets.
arXiv Detail & Related papers (2021-11-01T23:55:04Z) - Open-book Video Captioning with Retrieve-Copy-Generate Network [42.374461018847114]
In this paper, we convert traditional video captioning task into a new paradigm, ie, Open-book Video Captioning.
We propose a novel Retrieve-Copy-Generate network, where a pluggable video-to-text retriever is constructed to retrieve sentences as hints from the training corpus effectively.
Our framework coordinates the conventional retrieval-based methods with orthodox encoder-decoder methods, which can not only draw on the diverse expressions in the retrieved sentences but also generate natural and accurate content of the video.
arXiv Detail & Related papers (2021-03-09T08:17:17Z) - UniVL: A Unified Video and Language Pre-Training Model for Multimodal
Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation.
It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone.
We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv Detail & Related papers (2020-02-15T10:03:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.