Scalable and Accurate Self-supervised Multimodal Representation Learning
without Aligned Video and Text Data
- URL: http://arxiv.org/abs/2304.02080v1
- Date: Tue, 4 Apr 2023 19:11:05 GMT
- Title: Scalable and Accurate Self-supervised Multimodal Representation Learning
without Aligned Video and Text Data
- Authors: Vladislav Lialin, Stephen Rawls, David Chan, Shalini Ghosh, Anna
Rumshisky, Wael Hamza
- Abstract summary: Recent advances in image captioning allow us to pre-train high-quality video models without parallel video-text data.
We show that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions.
- Score: 18.479220305684837
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaling up weakly-supervised datasets has shown to be highly effective in the
image-text domain and has contributed to most of the recent state-of-the-art
computer vision and multimodal neural networks. However, existing large-scale
video-text datasets and mining techniques suffer from several limitations, such
as the scarcity of aligned data, the lack of diversity in the data, and the
difficulty of collecting aligned data. Currently popular video-text data mining
approach via automatic speech recognition (ASR) used in HowTo100M provides
low-quality captions that often do not refer to the video content. Other mining
approaches do not provide proper language descriptions (video tags) and are
biased toward short clips (alt text). In this work, we show how recent advances
in image captioning allow us to pre-train high-quality video models without any
parallel video-text data. We pre-train several video captioning models that are
based on an OPT language model and a TimeSformer visual backbone. We fine-tune
these networks on several video captioning datasets. First, we demonstrate that
image captioning pseudolabels work better for pre-training than the existing
HowTo100M ASR captions. Second, we show that pre-training on both images and
videos produces a significantly better network (+4 CIDER on MSR-VTT) than
pre-training on a single modality. Our methods are complementary to the
existing pre-training or data mining approaches and can be used in a variety of
settings. Given the efficacy of the pseudolabeling method, we are planning to
publicly release the generated captions.
Related papers
- Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action
Recognition with Language Knowledge [35.45809761628721]
Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities.
We propose an unsupervised approach to tuning video data for best zero-shot action recognition performance.
Our resulting models demonstrate high transferability to numerous unseen zero-shot downstream tasks.
arXiv Detail & Related papers (2023-03-15T20:17:41Z) - Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
Video Captioning [93.6842670770983]
Vid2Seq is a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.
We show that it is possible to leverage unlabeled narrated videos for dense video captioning, by reformulating sentence boundaries of transcribed speech as pseudo event boundaries.
The resulting Vid2Seq model pretrained on the YT-Temporal-1B dataset improves the state of the art on a variety of dense video captioning benchmarks.
arXiv Detail & Related papers (2023-02-27T19:53:49Z) - Learning Audio-Video Modalities from Image Captions [62.772232865072745]
A major challenge in text-video and text-audio retrieval is the lack of large-scale training data.
We propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.
We show that training a multimodal transformed based model on this data achieves competitive performance on video retrieval and video captioning, matching or even outperforming HowTo100M pretraining with 20x fewer clips.
arXiv Detail & Related papers (2022-04-01T19:48:18Z) - CLIP4Caption: CLIP for Video Caption [9.470254059503862]
We propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM)
This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation.
arXiv Detail & Related papers (2021-10-13T10:17:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.