Space-Time Crop & Attend: Improving Cross-modal Video Representation
Learning
- URL: http://arxiv.org/abs/2103.10211v1
- Date: Thu, 18 Mar 2021 12:32:24 GMT
- Title: Space-Time Crop & Attend: Improving Cross-modal Video Representation
Learning
- Authors: Mandela Patrick, Yuki M. Asano, Bernie Huang, Ishan Misra, Florian
Metze, Joao Henriques, Andrea Vedaldi
- Abstract summary: We show that spatial augmentations such as cropping work well for videos too, but that previous implementations could not do this at a scale sufficient for it to work well.
To address this issue, we first introduce Feature Crop, a method to simulate such augmentations much more efficiently directly in feature space.
Second, we show that as opposed to naive average pooling, the use of transformer-based attention performance improves significantly.
- Score: 88.71867887257274
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The quality of the image representations obtained from self-supervised
learning depends strongly on the type of data augmentations used in the
learning formulation. Recent papers have ported these methods from still images
to videos and found that leveraging both audio and video signals yields strong
gains; however, they did not find that spatial augmentations such as cropping,
which are very important for still images, work as well for videos. In this
paper, we improve these formulations in two ways unique to the spatio-temporal
aspect of videos. First, for space, we show that spatial augmentations such as
cropping do work well for videos too, but that previous implementations, due to
the high processing and memory cost, could not do this at a scale sufficient
for it to work well. To address this issue, we first introduce Feature Crop, a
method to simulate such augmentations much more efficiently directly in feature
space. Second, we show that as opposed to naive average pooling, the use of
transformer-based attention improves performance significantly, and is well
suited for processing feature crops. Combining both of our discoveries into a
new method, Space-time Crop & Attend (STiCA) we achieve state-of-the-art
performance across multiple video-representation learning benchmarks. In
particular, we achieve new state-of-the-art accuracies of 67.0% on HMDB-51 and
93.1% on UCF-101 when pre-training on Kinetics-400.
Related papers
- Time Does Tell: Self-Supervised Time-Tuning of Dense Image
Representations [79.87044240860466]
We propose a novel approach that incorporates temporal consistency in dense self-supervised learning.
Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos.
Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images.
arXiv Detail & Related papers (2023-08-22T21:28:58Z) - Extending Temporal Data Augmentation for Video Action Recognition [1.3807859854345832]
We propose novel techniques to strengthen the relationship between the spatial and temporal domains.
The video action recognition results of our techniques outperform their respective variants in Top-1 and Top-5 settings on the UCF-101 and the HMDB-51 datasets.
arXiv Detail & Related papers (2022-11-09T13:49:38Z) - Expanding Language-Image Pretrained Models for General Video Recognition [136.0948049010682]
Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data.
We present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly.
Our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols.
arXiv Detail & Related papers (2022-08-04T17:59:54Z) - Learn2Augment: Learning to Composite Videos for Data Augmentation in
Action Recognition [47.470845728457135]
We learn what makes a good video for action recognition and select only high-quality samples for augmentation.
We learn which pairs of videos to augment without having to actually composite them.
We see improvements of up to 8.6% in the semi-supervised setting.
arXiv Detail & Related papers (2022-06-09T23:04:52Z) - BEVT: BERT Pretraining of Video Transformers [89.08460834954161]
We introduce BEVT which decouples video representation learning into spatial representation learning and temporal dynamics learning.
We conduct extensive experiments on three challenging video benchmarks where BEVT achieves very promising results.
arXiv Detail & Related papers (2021-12-02T18:59:59Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.