Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
- URL: http://arxiv.org/abs/2212.04979v1
- Date: Fri, 9 Dec 2022 16:39:09 GMT
- Title: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
- Authors: Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh,
Yonghui Wu, Jiahui Yu
- Abstract summary: We present VideoCoCa that reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks.
VideoCoCa's zero-shot transfer baseline already achieves state-of-the-art results on zero-shot video classification.
Our approach establishes a simple and effective video-text baseline for future research.
- Score: 47.59597017035785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work explores an efficient approach to establish a foundational
video-text model for tasks including open-vocabulary video classification,
text-to-video retrieval, video captioning and video question-answering. We
present VideoCoCa that reuses a pretrained image-text contrastive captioner
(CoCa) model and adapt it to video-text tasks with minimal extra training.
While previous works adapt image-text models with various cross-frame fusion
modules (for example, cross-frame attention layer or perceiver resampler) and
finetune the modified architecture on video-text data, we surprisingly find
that the generative attentional pooling and contrastive attentional pooling
layers in the image-text CoCa design are instantly adaptable to ``flattened
frame embeddings'', yielding a strong zero-shot transfer baseline for many
video-text tasks. Specifically, the frozen image encoder of a pretrained
image-text CoCa takes each video frame as inputs and generates \(N\) token
embeddings per frame for totally \(T\) video frames. We flatten \(N \times T\)
token embeddings as a long sequence of frozen video representation and apply
CoCa's generative attentional pooling and contrastive attentional pooling on
top. All model weights including pooling layers are directly loaded from an
image-text CoCa pretrained model. Without any video or video-text data,
VideoCoCa's zero-shot transfer baseline already achieves state-of-the-art
results on zero-shot video classification on Kinetics 400/600/700, UCF101,
HMDB51, and Charades, as well as zero-shot text-to-video retrieval on MSR-VTT
and ActivityNet Captions. We also explore lightweight finetuning on top of
VideoCoCa, and achieve strong results on video question-answering (iVQA,
MSRVTT-QA, MSVD-QA) and video captioning (MSR-VTT, ActivityNet, Youcook2). Our
approach establishes a simple and effective video-text baseline for future
research.
Related papers
- Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Analyzing Zero-Shot Abilities of Vision-Language Models on Video
Understanding Tasks [6.925770576386087]
We propose a detailed study on the generalization abilities of image-text models when evaluated on video understanding tasks in a zero-shot setting.
Our experiments show that image-text models exhibit impressive performance on video AR, video RT and video MC.
These findings shed a light on the benefits of adapting foundational image-text models to an array of video tasks while avoiding the costly pretraining step.
arXiv Detail & Related papers (2023-10-07T20:57:54Z) - COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - TeViS:Translating Text Synopses to Video Storyboards [30.388090248346504]
We propose a new task called Text synopsis to Video Storyboard (TeViS)
It aims to retrieve an ordered sequence of images as the video storyboard to visualize the text synopsis.
VQ-Trans first encodes text synopsis and images into a joint embedding space and uses vector quantization (VQ) to improve the visual representation.
arXiv Detail & Related papers (2022-12-31T06:32:36Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z) - VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text
Understanding [13.640902299569008]
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding.
VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval.
arXiv Detail & Related papers (2021-09-28T23:01:51Z) - Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [80.7397409377659]
We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets.
Our model is flexible and can be trained on both image and video text datasets, either independently or in conjunction.
We show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks.
arXiv Detail & Related papers (2021-04-01T17:48:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.