Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
- URL: http://arxiv.org/abs/2104.00650v1
- Date: Thu, 1 Apr 2021 17:48:27 GMT
- Title: Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
- Authors: Max Bain, Arsha Nagrani, G\"ul Varol, Andrew Zisserman
- Abstract summary: We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets.
Our model is flexible and can be trained on both image and video text datasets, either independently or in conjunction.
We show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks.
- Score: 80.7397409377659
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Our objective in this work is video-text retrieval - in particular a joint
embedding that enables efficient text-to-video retrieval. The challenges in
this area include the design of the visual architecture and the nature of the
training data, in that the available large scale video-text training datasets,
such as HowTo100M, are noisy and hence competitive performance is achieved only
at scale through large amounts of compute. We address both these challenges in
this paper. We propose an end-to-end trainable model that is designed to take
advantage of both large-scale image and video captioning datasets. Our model is
an adaptation and extension of the recent ViT and Timesformer architectures,
and consists of attention in both space and time. The model is flexible and can
be trained on both image and video text datasets, either independently or in
conjunction. It is trained with a curriculum learning schedule that begins by
treating images as 'frozen' snapshots of video, and then gradually learns to
attend to increasing temporal context when trained on video datasets. We also
provide a new video-text pretraining dataset WebVid-2M, comprised of over two
million videos with weak captions scraped from the internet. Despite training
on datasets that are an order of magnitude smaller, we show that this approach
yields state-of-the-art results on standard downstream video-retrieval
benchmarks including MSR-VTT, MSVD, DiDeMo and LSMDC.
Related papers
- Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - Long-Form Video-Language Pre-Training with Multimodal Temporal
Contrastive Learning [39.80936685227549]
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks.
We introduce a Long-Form VIdeo-LAnguage pre-training model (VILA) and train it on a large-scale long-form video and paragraph dataset.
We fine-tune the model on seven downstream long-form video-language understanding tasks, achieve new state-of-the-art performances.
arXiv Detail & Related papers (2022-10-12T09:08:27Z) - iBoot: Image-bootstrapped Self-Supervised Video Representation Learning [45.845595749486215]
Video self-supervised learning (SSL) suffers from added challenges: video datasets are typically not as large as image datasets.
We propose to utilize a strong image-based model, pre-trained with self- or language supervision, in a video representation learning framework.
The proposed algorithm is shown to learn much more efficiently in less epochs and with a smaller batch.
arXiv Detail & Related papers (2022-06-16T17:42:48Z) - Learning Audio-Video Modalities from Image Captions [62.772232865072745]
A major challenge in text-video and text-audio retrieval is the lack of large-scale training data.
We propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.
We show that training a multimodal transformed based model on this data achieves competitive performance on video retrieval and video captioning, matching or even outperforming HowTo100M pretraining with 20x fewer clips.
arXiv Detail & Related papers (2022-04-01T19:48:18Z) - CLIP4Caption: CLIP for Video Caption [9.470254059503862]
We propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM)
This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation.
arXiv Detail & Related papers (2021-10-13T10:17:06Z) - CLIP2Video: Mastering Video-Text Retrieval via Image CLIP [13.270902407320005]
We present CLIP2Video network to transfer the image-language training model to video-text retrieval in an end-to-end manner.
We conduct thorough ablation studies, and achieve state-of-the-art performance on text-to-video and video-to-text retrieval benchmarks.
arXiv Detail & Related papers (2021-06-21T13:30:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.