Contextualized Spatio-Temporal Contrastive Learning with
Self-Supervision
- URL: http://arxiv.org/abs/2112.05181v1
- Date: Thu, 9 Dec 2021 19:13:41 GMT
- Title: Contextualized Spatio-Temporal Contrastive Learning with
Self-Supervision
- Authors: Liangzhe Yuan, Rui Qian, Yin Cui, Boqing Gong, Florian Schroff,
Ming-Hsuan Yang, Hartwig Adam, Ting Liu
- Abstract summary: We present ConST-CL framework to effectively learn-temporally fine-grained representations.
We first design a region-based self-supervised task which requires the model to learn to transform instance representations from one view to another guided by context features.
We then introduce a simple design that effectively reconciles the simultaneous learning of both holistic and local representations.
- Score: 106.77639982059014
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A modern self-supervised learning algorithm typically enforces persistency of
the representations of an instance across views. While being very effective on
learning holistic image and video representations, such an approach becomes
sub-optimal for learning spatio-temporally fine-grained features in videos,
where scenes and instances evolve through space and time. In this paper, we
present the Contextualized Spatio-Temporal Contrastive Learning (ConST-CL)
framework to effectively learn spatio-temporally fine-grained representations
using self-supervision. We first design a region-based self-supervised pretext
task which requires the model to learn to transform instance representations
from one view to another guided by context features. Further, we introduce a
simple network design that effectively reconciles the simultaneous learning
process of both holistic and local representations. We evaluate our learned
representations on a variety of downstream tasks and ConST-CL achieves
state-of-the-art results on four datasets. For spatio-temporal action
localization, ConST-CL achieves 39.4% mAP with ground-truth boxes and 30.5% mAP
with detected boxes on the AVA-Kinetics validation set. For object tracking,
ConST-CL achieves 78.1% precision and 55.2% success scores on OTB2015.
Furthermore, ConST-CL achieves 94.8% and 71.9% top-1 fine-tuning accuracy on
video action recognition datasets, UCF101 and HMDB51 respectively. We plan to
release our code and models to the public.
Related papers
- Debiasing, calibrating, and improving Semi-supervised Learning
performance via simple Ensemble Projector [0.0]
We propose a simple method named Ensemble Projectors Aided for Semi-supervised Learning (EPASS)
Unlike standard methods, EPASS stores the ensemble embeddings from multiple projectors in memory banks.
EPASS improves generalization, strengthens feature representation, and boosts performance.
arXiv Detail & Related papers (2023-10-24T12:11:19Z) - OCTScenes: A Versatile Real-World Dataset of Tabletop Scenes for
Object-Centric Learning [41.09407455527254]
We propose a versatile real-world dataset of tabletop scenes for object-centric learning called OCTScenes.
OCTScenes contains 5000 tabletop scenes with a total of 15 objects.
It is meticulously designed to serve as a benchmark for comparing, evaluating, and analyzing object-centric learning methods.
arXiv Detail & Related papers (2023-06-16T08:26:57Z) - Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via
Interpolated Weight Optimization [82.75718846187685]
We introduce Open-VCLIP, a simple yet effective approach that transforms CLIP into a strong zero-shot video classifier.
We show that training an Open-VCLIP is equivalent to continual learning with zero historical data.
In particular, we achieve 87.9%, 58.3%, 81.1% zero-shot accuracy on UCF, HMDB and Kinetics-600 datasets.
arXiv Detail & Related papers (2023-02-01T17:44:17Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - MST: Masked Self-Supervised Transformer for Visual Representation [52.099722121603506]
Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP)
We present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image.
MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation.
arXiv Detail & Related papers (2021-06-10T11:05:18Z) - Pairwise Similarity Knowledge Transfer for Weakly Supervised Object
Localization [53.99850033746663]
We study the problem of learning localization model on target classes with weakly supervised image labels.
In this work, we argue that learning only an objectness function is a weak form of knowledge transfer.
Experiments on the COCO and ILSVRC 2013 detection datasets show that the performance of the localization model improves significantly with the inclusion of pairwise similarity function.
arXiv Detail & Related papers (2020-03-18T17:53:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.