Video Action Recognition Collaborative Learning with Dynamics via
PSO-ConvNet Transformer
- URL: http://arxiv.org/abs/2302.09187v3
- Date: Thu, 21 Sep 2023 08:05:15 GMT
- Title: Video Action Recognition Collaborative Learning with Dynamics via
PSO-ConvNet Transformer
- Authors: Nguyen Huu Phong, Bernardete Ribeiro
- Abstract summary: We propose a novel PSO-ConvNet model for learning actions in videos.
Our experimental results on the UCF-101 dataset demonstrate substantial improvements of up to 9% in accuracy.
Overall, our dynamic PSO-ConvNet model provides a promising direction for improving Human Action Recognition.
- Score: 1.876462046907555
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recognizing human actions in video sequences, known as Human Action
Recognition (HAR), is a challenging task in pattern recognition. While
Convolutional Neural Networks (ConvNets) have shown remarkable success in image
recognition, they are not always directly applicable to HAR, as temporal
features are critical for accurate classification. In this paper, we propose a
novel dynamic PSO-ConvNet model for learning actions in videos, building on our
recent work in image recognition. Our approach leverages a framework where the
weight vector of each neural network represents the position of a particle in
phase space, and particles share their current weight vectors and gradient
estimates of the Loss function. To extend our approach to video, we integrate
ConvNets with state-of-the-art temporal methods such as Transformer and
Recurrent Neural Networks. Our experimental results on the UCF-101 dataset
demonstrate substantial improvements of up to 9% in accuracy, which confirms
the effectiveness of our proposed method. In addition, we conducted experiments
on larger and more variety of datasets including Kinetics-400 and HMDB-51 and
obtained preference for Collaborative Learning in comparison with
Non-Collaborative Learning (Individual Learning). Overall, our dynamic
PSO-ConvNet model provides a promising direction for improving HAR by better
capturing the spatio-temporal dynamics of human actions in videos. The code is
available at
https://github.com/leonlha/Video-Action-Recognition-Collaborative-Learning-with-Dynamics-via-PSO-Con vNet-Transformer.
Related papers
- An Animation-based Augmentation Approach for Action Recognition from Discontinuous Video [11.293897932762809]
Action recognition, an essential component of computer vision, plays a pivotal role in multiple applications.
CNNs suffer performance declines when trained with discontinuous video frames, which is a frequent scenario in real-world settings.
To overcome this issue, we introduce the 4A pipeline, which employs a series of sophisticated techniques.
arXiv Detail & Related papers (2024-04-10T04:59:51Z) - ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos [4.736059095502584]
This work proposes a novel approach using Cross-Architecture Pseudo-Labeling with contrastive learning for semi-supervised action recognition.
We introduce a novel cross-architecture approach where 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) are utilised to capture different aspects of action representations.
arXiv Detail & Related papers (2024-04-09T12:09:56Z) - Deep Learning Approaches for Human Action Recognition in Video Data [0.8080830346931087]
This study conducts an in-depth analysis of various deep learning models to address this challenge.
We focus on Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Two-Stream ConvNets.
The results of this study underscore the potential of composite models in achieving robust human action recognition.
arXiv Detail & Related papers (2024-03-11T15:31:25Z) - Differentiable Frequency-based Disentanglement for Aerial Video Action
Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos.
Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras.
We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Self-Supervised Learning via multi-Transformation Classification for
Action Recognition [10.676377556393527]
We introduce a self-supervised video representation learning method based on the multi-transformation classification to efficiently classify human actions.
The representation of the video is learned in a self-supervised manner by classifying seven different transformations.
We have conducted the experiments on UCF101 and HMDB51 datasets together with C3D and 3D Resnet-18 as backbone networks.
arXiv Detail & Related papers (2021-02-20T16:11:26Z) - Complex Human Action Recognition in Live Videos Using Hybrid FR-DL
Method [1.027974860479791]
We address challenges of the preprocessing phase, by an automated selection of representative frames among the input sequences.
We propose a hybrid technique using background subtraction and HOG, followed by application of a deep neural network and skeletal modelling method.
We name our model as Feature Reduction & Deep Learning based action recognition method, or FR-DL in short.
arXiv Detail & Related papers (2020-07-06T15:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.