Self-Supervised Learning via multi-Transformation Classification for
Action Recognition
- URL: http://arxiv.org/abs/2102.10378v1
- Date: Sat, 20 Feb 2021 16:11:26 GMT
- Title: Self-Supervised Learning via multi-Transformation Classification for
Action Recognition
- Authors: Duc Quang Vu, Ngan T.H.Le and Jia-Ching Wang
- Abstract summary: We introduce a self-supervised video representation learning method based on the multi-transformation classification to efficiently classify human actions.
The representation of the video is learned in a self-supervised manner by classifying seven different transformations.
We have conducted the experiments on UCF101 and HMDB51 datasets together with C3D and 3D Resnet-18 as backbone networks.
- Score: 10.676377556393527
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised tasks have been utilized to build useful representations that
can be used in downstream tasks when the annotation is unavailable. In this
paper, we introduce a self-supervised video representation learning method
based on the multi-transformation classification to efficiently classify human
actions. Self-supervised learning on various transformations not only provides
richer contextual information but also enables the visual representation more
robust to the transforms. The spatio-temporal representation of the video is
learned in a self-supervised manner by classifying seven different
transformations i.e. rotation, clip inversion, permutation, split, join
transformation, color switch, frame replacement, noise addition. First, seven
different video transformations are applied to video clips. Then the 3D
convolutional neural networks are utilized to extract features for clips and
these features are processed to classify the pseudo-labels. We use the learned
models in pretext tasks as the pre-trained models and fine-tune them to
recognize human actions in the downstream task. We have conducted the
experiments on UCF101 and HMDB51 datasets together with C3D and 3D Resnet-18 as
backbone networks. The experimental results have shown that our proposed
framework is outperformed other SOTA self-supervised action recognition
approaches. The code will be made publicly available.
Related papers
- ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos [4.736059095502584]
This work proposes a novel approach using Cross-Architecture Pseudo-Labeling with contrastive learning for semi-supervised action recognition.
We introduce a novel cross-architecture approach where 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) are utilised to capture different aspects of action representations.
arXiv Detail & Related papers (2024-04-09T12:09:56Z) - Video Action Recognition Collaborative Learning with Dynamics via
PSO-ConvNet Transformer [1.876462046907555]
We propose a novel PSO-ConvNet model for learning actions in videos.
Our experimental results on the UCF-101 dataset demonstrate substantial improvements of up to 9% in accuracy.
Overall, our dynamic PSO-ConvNet model provides a promising direction for improving Human Action Recognition.
arXiv Detail & Related papers (2023-02-17T23:39:34Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - Cross-Architecture Self-supervised Video Representation Learning [42.267775859095664]
We present a new cross-architecture contrastive learning framework for self-supervised video representation learning.
We introduce a temporal self-supervised learning module able to predict an Edit distance explicitly between two video sequences.
We evaluate our method on the tasks of video retrieval and action recognition on UCF101 and HMDB51 datasets.
arXiv Detail & Related papers (2022-05-26T12:41:19Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - Cloud based Scalable Object Recognition from Video Streams using
Orientation Fusion and Convolutional Neural Networks [11.44782606621054]
Convolutional neural networks (CNNs) have been widely used to perform intelligent visual object recognition.
CNNs still suffer from severe accuracy degradation, particularly on illumination-variant datasets.
We propose a new CNN method based on orientation fusion for visual object recognition.
arXiv Detail & Related papers (2021-06-19T07:15:15Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Recognizing Actions in Videos from Unseen Viewpoints [80.6338404141284]
We show that current convolutional neural network models are unable to recognize actions from camera viewpoints not present in training data.
We introduce a new dataset for unseen view recognition and show the approaches ability to learn viewpoint invariant representations.
arXiv Detail & Related papers (2021-03-30T17:17:54Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - Self-Supervised Representation Learning from Flow Equivariance [97.13056332559526]
We present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes.
Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images.
arXiv Detail & Related papers (2021-01-16T23:44:09Z) - Video Representation Learning by Recognizing Temporal Transformations [37.59322456034611]
We introduce a novel self-supervised learning approach to learn representations of videos responsive to changes in the motion dynamics.
We promote an accurate learning of motion without human annotation by training a neural network to discriminate a video sequence from its temporally transformed versions.
Our experiments show that networks trained with the proposed method yield representations with improved transfer performance for action recognition.
arXiv Detail & Related papers (2020-07-21T11:43:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.