PreViTS: Contrastive Pretraining with Video Tracking Supervision
- URL: http://arxiv.org/abs/2112.00804v1
- Date: Wed, 1 Dec 2021 19:49:57 GMT
- Title: PreViTS: Contrastive Pretraining with Video Tracking Supervision
- Authors: Brian Chen, Ramprasaath R. Selvaraju, Shih-Fu Chang, Juan Carlos
Niebles, and Nikhil Naik
- Abstract summary: PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
- Score: 53.73237606312024
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Videos are a rich source for self-supervised learning (SSL) of visual
representations due to the presence of natural temporal transformations of
objects. However, current methods typically randomly sample video clips for
learning, which results in a poor supervisory signal. In this work, we propose
PreViTS, an SSL framework that utilizes an unsupervised tracking signal for
selecting clips containing the same object, which helps better utilize temporal
transformations of objects. PreViTS further uses the tracking signal to
spatially constrain the frame regions to learn from and trains the model to
locate meaningful objects by providing supervision on Grad-CAM attention maps.
To evaluate our approach, we train a momentum contrastive (MoCo) encoder on
VGG-Sound and Kinetics-400 datasets with PreViTS. Training with PreViTS
outperforms representations learnt by MoCo alone on both image recognition and
video classification downstream tasks, obtaining state-of-the-art performance
on action classification. PreViTS helps learn feature representations that are
more robust to changes in background and context, as seen by experiments on
image and video datasets with background changes. Learning from large-scale
uncurated videos with PreViTS could lead to more accurate and robust visual
feature representations.
Related papers
- Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases [69.46487306858789]
Conditional Autoregressive Slot Attention (CA-SA) is a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks.
We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks.
arXiv Detail & Related papers (2024-10-21T07:44:44Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Pretraining the Vision Transformer using self-supervised methods for
vision based Deep Reinforcement Learning [0.0]
We study pretraining a Vision Transformer using several state-of-the-art self-supervised methods and assess the quality of the learned representations.
Our results show that all methods are effective in learning useful representations and avoiding representational collapse.
The encoder pretrained with the temporal order verification task shows the best results across all experiments.
arXiv Detail & Related papers (2022-09-22T10:18:59Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - SAVi++: Towards End-to-End Object-Centric Learning from Real-World
Videos [23.64091569954785]
We introduce SAVi++, an object-centric video model which is trained to predict depth signals from a slot-based video representation.
By using sparse depth signals obtained from LiDAR, SAVi++ is able to learn emergent object segmentation and tracking from videos in the real-world Open dataset.
arXiv Detail & Related papers (2022-06-15T18:57:07Z) - Where are my Neighbors? Exploiting Patches Relations in Self-Supervised
Vision Transformer [3.158346511479111]
We propose a simple but still effective self-supervised learning (SSL) strategy to train Vision Transformers (ViTs)
We define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly during the downstream training.
Our RelViT model optimize all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signal at each training step.
arXiv Detail & Related papers (2022-06-01T13:25:32Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.