iBoot: Image-bootstrapped Self-Supervised Video Representation Learning
- URL: http://arxiv.org/abs/2206.08339v1
- Date: Thu, 16 Jun 2022 17:42:48 GMT
- Title: iBoot: Image-bootstrapped Self-Supervised Video Representation Learning
- Authors: Fatemeh Saleh, Fuwen Tan, Adrian Bulat, Georgios Tzimiropoulos, and
Brais Martinez
- Abstract summary: Video self-supervised learning (SSL) suffers from added challenges: video datasets are typically not as large as image datasets.
We propose to utilize a strong image-based model, pre-trained with self- or language supervision, in a video representation learning framework.
The proposed algorithm is shown to learn much more efficiently in less epochs and with a smaller batch.
- Score: 45.845595749486215
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning visual representations through self-supervision is an extremely
challenging task as the network needs to sieve relevant patterns from spurious
distractors without the active guidance provided by supervision. This is
achieved through heavy data augmentation, large-scale datasets and prohibitive
amounts of compute. Video self-supervised learning (SSL) suffers from added
challenges: video datasets are typically not as large as image datasets,
compute is an order of magnitude larger, and the amount of spurious patterns
the optimizer has to sieve through is multiplied several fold. Thus, directly
learning self-supervised representations from video data might result in
sub-optimal performance. To address this, we propose to utilize a strong
image-based model, pre-trained with self- or language supervision, in a video
representation learning framework, enabling the model to learn strong spatial
and temporal information without relying on the video labeled data. To this
end, we modify the typical video-based SSL design and objective to encourage
the video encoder to \textit{subsume} the semantic content of an image-based
model trained on a general domain. The proposed algorithm is shown to learn
much more efficiently (i.e. in less epochs and with a smaller batch) and
results in a new state-of-the-art performance on standard downstream tasks
among single-modality SSL methods.
Related papers
- Time Does Tell: Self-Supervised Time-Tuning of Dense Image
Representations [79.87044240860466]
We propose a novel approach that incorporates temporal consistency in dense self-supervised learning.
Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos.
Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images.
arXiv Detail & Related papers (2023-08-22T21:28:58Z) - GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception
Tasks? [51.22096780511165]
We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations.
We feed detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images.
arXiv Detail & Related papers (2023-06-01T14:02:45Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos.
We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries.
An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z) - Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [80.7397409377659]
We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets.
Our model is flexible and can be trained on both image and video text datasets, either independently or in conjunction.
We show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks.
arXiv Detail & Related papers (2021-04-01T17:48:27Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z) - Representation Learning with Video Deep InfoMax [26.692717942430185]
We extend DeepInfoMax to the video domain by leveraging similar structure intemporal networks.
We find that drawing views from both natural-rate sequences and temporally-downsampled sequences yields results on Kinetics-pretrained action recognition tasks.
arXiv Detail & Related papers (2020-07-27T02:28:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.