TransRank: Self-supervised Video Representation Learning via
Ranking-based Transformation Recognition
- URL: http://arxiv.org/abs/2205.02028v1
- Date: Wed, 4 May 2022 12:39:25 GMT
- Title: TransRank: Self-supervised Video Representation Learning via
Ranking-based Transformation Recognition
- Authors: Haodong Duan, Nanxuan Zhao, Kai Chen, Dahua Lin
- Abstract summary: We observe the great potential of RecogTrans on both semantic-related and temporal-related downstream tasks.
Based on hard-label classification, existing RecogTrans approaches suffer from noisy supervision signals in pre-training.
To mitigate this problem, we developed TransRank, a unified framework for recognizing Transformations in a Ranking formulation.
- Score: 73.7566539108205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recognizing transformation types applied to a video clip (RecogTrans) is a
long-established paradigm for self-supervised video representation learning,
which achieves much inferior performance compared to instance discrimination
approaches (InstDisc) in recent works. However, based on a thorough comparison
of representative RecogTrans and InstDisc methods, we observe the great
potential of RecogTrans on both semantic-related and temporal-related
downstream tasks. Based on hard-label classification, existing RecogTrans
approaches suffer from noisy supervision signals in pre-training. To mitigate
this problem, we developed TransRank, a unified framework for recognizing
Transformations in a Ranking formulation. TransRank provides accurate
supervision signals by recognizing transformations relatively, consistently
outperforming the classification-based formulation. Meanwhile, the unified
framework can be instantiated with an arbitrary set of temporal or spatial
transformations, demonstrating good generality. With a ranking-based
formulation and several empirical practices, we achieve competitive performance
on video retrieval and action recognition. Under the same setting, TransRank
surpasses the previous state-of-the-art method by 6.4% on UCF101 and 8.3% on
HMDB51 for action recognition (Top1 Acc); improves video retrieval on UCF101 by
20.4% (R@1). The promising results validate that RecogTrans is still a worth
exploring paradigm for video self-supervised learning. Codes will be released
at https://github.com/kennymckormick/TransRank.
Related papers
- kTrans: Knowledge-Aware Transformer for Binary Code Embedding [15.361622199889263]
We propose a novel Transformer-based approach, namely kTrans, to generate knowledge-aware binary code embedding.
We inspect the generated embeddings with outlier detection and visualization, and also apply kTrans to 3 downstream tasks: Binary Code Similarity Detection (BCSD), Function Type Recovery (FTR) and Indirect Call Recognition (ICR)
Evaluation results show that kTrans can generate high-quality binary code embeddings, and outperforms state-of-the-art (SOTA) approaches on downstream tasks by 5.2%, 6.8%, and 12.6% respectively.
arXiv Detail & Related papers (2023-08-24T09:07:11Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - SVFormer: Semi-supervised Video Transformer for Action Recognition [88.52042032347173]
We introduce SVFormer, which adopts a steady pseudo-labeling framework to cope with unlabeled video samples.
In addition, we propose a temporal warping to cover the complex temporal variation in videos.
In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400.
arXiv Detail & Related papers (2022-11-23T18:58:42Z) - Transfer of Representations to Video Label Propagation: Implementation
Factors Matter [31.030799003595522]
We study the impact of important implementation factors in feature extraction and label propagation.
We show that augmenting video-based correspondence cues with still-image-based ones can further improve performance.
We hope that this study will help to improve evaluation practices and better inform future research directions in temporal correspondence.
arXiv Detail & Related papers (2022-03-10T18:58:22Z) - Time-Equivariant Contrastive Video Representation Learning [47.50766781135863]
We introduce a novel self-supervised contrastive learning method to learn representations from unlabelled videos.
Our experiments show that time-equivariant representations achieve state-of-the-art results in video retrieval and action recognition benchmarks.
arXiv Detail & Related papers (2021-12-07T10:45:43Z) - Joint Inductive and Transductive Learning for Video Object Segmentation [107.32760625159301]
Semi-supervised object segmentation is a task of segmenting the target object in a video sequence given only a mask in the first frame.
Most previous best-performing methods adopt matching-based transductive reasoning or online inductive learning.
We propose to integrate transductive and inductive learning into a unified framework to exploit complement between them for accurate and robust video object segmentation.
arXiv Detail & Related papers (2021-08-08T16:25:48Z) - Domain Adaptive Robotic Gesture Recognition with Unsupervised
Kinematic-Visual Data Alignment [60.31418655784291]
We propose a novel unsupervised domain adaptation framework which can simultaneously transfer multi-modality knowledge, i.e., both kinematic and visual data, from simulator to real robot.
It remedies the domain gap with enhanced transferable features by using temporal cues in videos, and inherent correlations in multi-modal towards recognizing gesture.
Results show that our approach recovers the performance with great improvement gains, up to 12.91% in ACC and 20.16% in F1score without using any annotations in real robot.
arXiv Detail & Related papers (2021-03-06T09:10:03Z) - Self-Supervised Learning via multi-Transformation Classification for
Action Recognition [10.676377556393527]
We introduce a self-supervised video representation learning method based on the multi-transformation classification to efficiently classify human actions.
The representation of the video is learned in a self-supervised manner by classifying seven different transformations.
We have conducted the experiments on UCF101 and HMDB51 datasets together with C3D and 3D Resnet-18 as backbone networks.
arXiv Detail & Related papers (2021-02-20T16:11:26Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z) - Self-supervised learning using consistency regularization of
spatio-temporal data augmentation for action recognition [15.701647552427708]
We present a novel way to obtain the surrogate supervision signal based on high-level feature maps under consistency regularization.
Our method achieves substantial improvements compared with state-of-the-art self-supervised learning methods for action recognition.
arXiv Detail & Related papers (2020-08-05T12:41:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.