Unsupervised Visual Representation Learning by Tracking Patches in Video
- URL: http://arxiv.org/abs/2105.02545v1
- Date: Thu, 6 May 2021 09:46:42 GMT
- Title: Unsupervised Visual Representation Learning by Tracking Patches in Video
- Authors: Guangting Wang, Yizhou Zhou, Chong Luo, Wenxuan Xie, Wenjun Zeng, and
Zhiwei Xiong
- Abstract summary: We propose to use tracking as a proxy task for a computer vision system to learn the visual representations.
Modelled on the Catch game played by the children, we design a Catch-the-Patch (CtP) game for a 3D-CNN model to learn visual representations.
- Score: 88.56860674483752
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Inspired by the fact that human eyes continue to develop tracking ability in
early and middle childhood, we propose to use tracking as a proxy task for a
computer vision system to learn the visual representations. Modelled on the
Catch game played by the children, we design a Catch-the-Patch (CtP) game for a
3D-CNN model to learn visual representations that would help with video-related
tasks. In the proposed pretraining framework, we cut an image patch from a
given video and let it scale and move according to a pre-set trajectory. The
proxy task is to estimate the position and size of the image patch in a
sequence of video frames, given only the target bounding box in the first
frame. We discover that using multiple image patches simultaneously brings
clear benefits. We further increase the difficulty of the game by randomly
making patches invisible. Extensive experiments on mainstream benchmarks
demonstrate the superior performance of CtP against other video pretraining
methods. In addition, CtP-pretrained features are less sensitive to domain gaps
than those trained by a supervised action recognition task. When both trained
on Kinetics-400, we are pleasantly surprised to find that CtP-pretrained
representation achieves much higher action classification accuracy than its
fully supervised counterpart on Something-Something dataset. Code is available
online: github.com/microsoft/CtP.
Related papers
- Self-supervised and Weakly Supervised Contrastive Learning for
Frame-wise Action Representations [26.09611987412578]
We introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner.
Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context.
Our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference.
arXiv Detail & Related papers (2022-12-06T16:42:22Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z) - SeCo: Exploring Sequence Supervision for Unsupervised Representation
Learning [114.58986229852489]
In this paper, we explore the basic and generic supervision in the sequence from spatial, sequential and temporal perspectives.
We derive a particular form named Contrastive Learning (SeCo)
SeCo shows superior results under the linear protocol on action recognition, untrimmed activity recognition and object tracking.
arXiv Detail & Related papers (2020-08-03T15:51:35Z) - VirTex: Learning Visual Representations from Textual Annotations [25.104705278771895]
VirTex is a pretraining approach using semantically dense captions to learn visual representations.
We train convolutional networks from scratch on COCO Captions, and transfer them to downstream recognition tasks.
On all tasks, VirTex yields features that match or exceed those learned on ImageNet -- supervised or unsupervised.
arXiv Detail & Related papers (2020-06-11T17:58:48Z) - Disentangling Controllable Object through Video Prediction Improves
Visual Reinforcement Learning [82.25034245150582]
In many vision-based reinforcement learning problems, the agent controls a movable object in its visual field.
We propose an end-to-end learning framework to disentangle the controllable object from the observation signal.
The disentangled representation is shown to be useful for RL as additional observation channels to the agent.
arXiv Detail & Related papers (2020-02-21T05:43:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.