Self-supervised Video Representation Learning with Cross-Stream
Prototypical Contrasting
- URL: http://arxiv.org/abs/2106.10137v2
- Date: Mon, 21 Jun 2021 09:41:01 GMT
- Title: Self-supervised Video Representation Learning with Cross-Stream
Prototypical Contrasting
- Authors: Martine Toering, Ioannis Gatopoulos, Maarten Stol, Vincent Tao Hu
- Abstract summary: "Video Cross-Stream Prototypical Contrasting" is a novel method which predicts consistent prototype assignments from both RGB and optical flow views.
We obtain state-of-the-art results on nearest neighbour video retrieval and action recognition.
- Score: 2.2530496464901106
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instance-level contrastive learning techniques, which rely on data
augmentation and a contrastive loss function, have found great success in the
domain of visual representation learning. They are not suitable for exploiting
the rich dynamical structure of video however, as operations are done on many
augmented instances. In this paper we propose "Video Cross-Stream Prototypical
Contrasting", a novel method which predicts consistent prototype assignments
from both RGB and optical flow views, operating on sets of samples.
Specifically, we alternate the optimization process; while optimizing one of
the streams, all views are mapped to one set of stream prototype vectors. Each
of the assignments is predicted with all views except the one matching the
prediction, pushing representations closer to their assigned prototypes. As a
result, more efficient video embeddings with ingrained motion information are
learned, without the explicit need for optical flow computation during
inference. We obtain state-of-the-art results on nearest neighbour video
retrieval and action recognition, outperforming previous best by +3.2% on
UCF101 using the S3D backbone (90.5% Top-1 acc), and by +7.2% on UCF101 and
+15.1% on HMDB51 using the R(2+1)D backbone.
Related papers
- Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - Building an Open-Vocabulary Video CLIP Model with Better Architectures,
Optimization and Data [102.0069667710562]
This paper presents Open-VCLIP++, a framework that adapts CLIP to a strong zero-shot video classifier.
We demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data.
Our approach is evaluated on three widely used action recognition datasets.
arXiv Detail & Related papers (2023-10-08T04:46:43Z) - It Takes Two: Masked Appearance-Motion Modeling for Self-supervised
Video Transformer Pre-training [76.69480467101143]
Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline.
We explicitly investigate motion cues in videos as extra prediction target and propose our Masked Appearance-Motion Modeling framework.
Our method learns generalized video representations and achieves 82.3% on Kinects-400, 71.3% on Something-Something V2, 91.5% on UCF101, and 62.5% on HMDB51.
arXiv Detail & Related papers (2022-10-11T08:05:18Z) - Motion Sensitive Contrastive Learning for Self-supervised Video
Representation [34.854431881562576]
Motion Sensitive Contrastive Learning (MSCL) injects the motion information captured by optical flows into RGB frames to strengthen feature learning.
Local Motion Contrastive Learning (LMCL) with frame-level contrastive objectives across the two modalities.
Flow Rotation Augmentation (FRA) to generate extra motion-shuffled negative samples and Motion Differential Sampling (MDS) to accurately screen training samples.
arXiv Detail & Related papers (2022-08-12T04:06:56Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - Deep Video Prior for Video Consistency and Propagation [58.250209011891904]
We present a novel and general approach for blind video temporal consistency.
Our method is only trained on a pair of original and processed videos directly instead of a large dataset.
We show that temporal consistency can be achieved by training a convolutional neural network on a video with Deep Video Prior.
arXiv Detail & Related papers (2022-01-27T16:38:52Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - TCLR: Temporal Contrastive Learning for Video Representation [49.6637562402604]
We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video representation learning methods.
With the commonly used 3D-ResNet-18 architecture, we achieve 82.4% (+5.1% increase over the previous best) top-1 accuracy on UCF101 and 52.9% (+5.4% increase) on HMDB51 action classification.
arXiv Detail & Related papers (2021-01-20T05:38:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.