Related papers: Video Action Recognition Collaborative Learning with Dynamics via PSO-ConvNet Transformer

Video Action Recognition Collaborative Learning with Dynamics via PSO-ConvNet Transformer

URL: http://arxiv.org/abs/2302.09187v3
Date: Thu, 21 Sep 2023 08:05:15 GMT
Title: Video Action Recognition Collaborative Learning with Dynamics via PSO-ConvNet Transformer
Authors: Nguyen Huu Phong, Bernardete Ribeiro
Abstract summary: We propose a novel PSO-ConvNet model for learning actions in videos. Our experimental results on the UCF-101 dataset demonstrate substantial improvements of up to 9% in accuracy. Overall, our dynamic PSO-ConvNet model provides a promising direction for improving Human Action Recognition.
Score: 1.876462046907555
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recognizing human actions in video sequences, known as Human Action Recognition (HAR), is a challenging task in pattern recognition. While Convolutional Neural Networks (ConvNets) have shown remarkable success in image recognition, they are not always directly applicable to HAR, as temporal features are critical for accurate classification. In this paper, we propose a novel dynamic PSO-ConvNet model for learning actions in videos, building on our recent work in image recognition. Our approach leverages a framework where the weight vector of each neural network represents the position of a particle in phase space, and particles share their current weight vectors and gradient estimates of the Loss function. To extend our approach to video, we integrate ConvNets with state-of-the-art temporal methods such as Transformer and Recurrent Neural Networks. Our experimental results on the UCF-101 dataset demonstrate substantial improvements of up to 9% in accuracy, which confirms the effectiveness of our proposed method. In addition, we conducted experiments on larger and more variety of datasets including Kinetics-400 and HMDB-51 and obtained preference for Collaborative Learning in comparison with Non-Collaborative Learning (Individual Learning). Overall, our dynamic PSO-ConvNet model provides a promising direction for improving HAR by better capturing the spatio-temporal dynamics of human actions in videos. The code is available at https://github.com/leonlha/Video-Action-Recognition-Collaborative-Learning-with-Dynamics-via-PSO-Con vNet-Transformer.

Related papers

Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning [73.7808110878037]
This paper proposes a novel dual-stream Multi-Timescale Motion-Decoupled Spiking Transformer (MDST++)<n>By converting RGB images to events, our method captures motion information more accurately and mitigates background scene biases.<n>Our experiments validate the effectiveness of MDST++, demonstrating their consistent superiority over state-of-the-art methods on mainstream benchmarks.
arXiv Detail & Related papers (2025-05-26T13:06:01Z)
An Animation-based Augmentation Approach for Action Recognition from Discontinuous Video [11.293897932762809]
Action recognition, an essential component of computer vision, plays a pivotal role in multiple applications. CNNs suffer performance declines when trained with discontinuous video frames, which is a frequent scenario in real-world settings. To overcome this issue, we introduce the 4A pipeline, which employs a series of sophisticated techniques.
arXiv Detail & Related papers (2024-04-10T04:59:51Z)
ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos [4.736059095502584]
This work proposes a novel approach using Cross-Architecture Pseudo-Labeling with contrastive learning for semi-supervised action recognition. We introduce a novel cross-architecture approach where 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) are utilised to capture different aspects of action representations.
arXiv Detail & Related papers (2024-04-09T12:09:56Z)
Deep Learning Approaches for Human Action Recognition in Video Data [0.8080830346931087]
This study conducts an in-depth analysis of various deep learning models to address this challenge. We focus on Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Two-Stream ConvNets. The results of this study underscore the potential of composite models in achieving robust human action recognition.
arXiv Detail & Related papers (2024-03-11T15:31:25Z)
Differentiable Frequency-based Disentanglement for Aerial Video Action Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos. Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras. We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z)
Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm. Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks. We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z)
PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object. PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects. We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z)
EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content. First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z)
STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data. Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z)
CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis. Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge. We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z)
Self-Supervised Learning via multi-Transformation Classification for Action Recognition [10.676377556393527]
We introduce a self-supervised video representation learning method based on the multi-transformation classification to efficiently classify human actions. The representation of the video is learned in a self-supervised manner by classifying seven different transformations. We have conducted the experiments on UCF101 and HMDB51 datasets together with C3D and 3D Resnet-18 as backbone networks.
arXiv Detail & Related papers (2021-02-20T16:11:26Z)
Complex Human Action Recognition in Live Videos Using Hybrid FR-DL Method [1.027974860479791]
We address challenges of the preprocessing phase, by an automated selection of representative frames among the input sequences. We propose a hybrid technique using background subtraction and HOG, followed by application of a deep neural network and skeletal modelling method. We name our model as Feature Reduction & Deep Learning based action recognition method, or FR-DL in short.
arXiv Detail & Related papers (2020-07-06T15:12:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.