ConViViT -- A Deep Neural Network Combining Convolutions and Factorized
Self-Attention for Human Activity Recognition
- URL: http://arxiv.org/abs/2310.14416v1
- Date: Sun, 22 Oct 2023 21:13:43 GMT
- Title: ConViViT -- A Deep Neural Network Combining Convolutions and Factorized
Self-Attention for Human Activity Recognition
- Authors: Rachid Reda Dokkar, Faten Chaieb, Hassen Drira and Arezki Aberkane
- Abstract summary: We propose a novel approach that leverages the strengths of both CNNs and Transformers in a hybrid architecture for performing activity recognition using RGB videos.
Our architecture has achieved new SOTA results with 90.05 %, 99.6%, and 95.09% on HMDB51, UCF101, and ETRI-Activity3D respectively.
- Score: 3.6321891270689055
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The Transformer architecture has gained significant popularity in computer
vision tasks due to its capacity to generalize and capture long-range
dependencies. This characteristic makes it well-suited for generating
spatiotemporal tokens from videos. On the other hand, convolutions serve as the
fundamental backbone for processing images and videos, as they efficiently
aggregate information within small local neighborhoods to create spatial tokens
that describe the spatial dimension of a video. While both CNN-based
architectures and pure transformer architectures are extensively studied and
utilized by researchers, the effective combination of these two backbones has
not received comparable attention in the field of activity recognition. In this
research, we propose a novel approach that leverages the strengths of both CNNs
and Transformers in an hybrid architecture for performing activity recognition
using RGB videos. Specifically, we suggest employing a CNN network to enhance
the video representation by generating a 128-channel video that effectively
separates the human performing the activity from the background. Subsequently,
the output of the CNN module is fed into a transformer to extract
spatiotemporal tokens, which are then used for classification purposes. Our
architecture has achieved new SOTA results with 90.05 \%, 99.6\%, and 95.09\%
on HMDB51, UCF101, and ETRI-Activity3D respectively.
Related papers
- Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Video Action Recognition Collaborative Learning with Dynamics via
PSO-ConvNet Transformer [1.876462046907555]
We propose a novel PSO-ConvNet model for learning actions in videos.
Our experimental results on the UCF-101 dataset demonstrate substantial improvements of up to 9% in accuracy.
Overall, our dynamic PSO-ConvNet model provides a promising direction for improving Human Action Recognition.
arXiv Detail & Related papers (2023-02-17T23:39:34Z) - Bridging the Gap Between Vision Transformers and Convolutional Neural
Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets.
We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases.
Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z) - Differentiable Frequency-based Disentanglement for Aerial Video Action
Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos.
Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras.
We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z) - ViT-ReT: Vision and Recurrent Transformer Neural Networks for Human
Activity Recognition in Videos [6.117917355232902]
This paper proposes and designs two transformer neural networks for human activity recognition.
A recurrent transformer (ReT) is a specialized neural network used to make predictions on sequences of data, and a vision transformer (ViT) is a vision transformer optimized for extracting salient features from images.
We have provided an extensive comparison of the proposed transformer neural networks with the contemporary CNN and RNN-based human activity recognition models in terms of speed and accuracy.
arXiv Detail & Related papers (2022-08-16T20:03:53Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - Relational Self-Attention: What's Missing in Attention for Video
Understanding [52.38780998425556]
We introduce a relational feature transform, dubbed the relational self-attention (RSA)
Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts.
arXiv Detail & Related papers (2021-11-02T15:36:11Z) - Cloud based Scalable Object Recognition from Video Streams using
Orientation Fusion and Convolutional Neural Networks [11.44782606621054]
Convolutional neural networks (CNNs) have been widely used to perform intelligent visual object recognition.
CNNs still suffer from severe accuracy degradation, particularly on illumination-variant datasets.
We propose a new CNN method based on orientation fusion for visual object recognition.
arXiv Detail & Related papers (2021-06-19T07:15:15Z) - Dense Interaction Learning for Video-based Person Re-identification [75.03200492219003]
We propose a hybrid framework, Dense Interaction Learning (DenseIL), to tackle video-based person re-ID difficulties.
DenseIL contains a CNN encoder and a Dense Interaction (DI) decoder.
Our experiments consistently and significantly outperform all the state-of-the-art methods on multiple standard video-based re-ID datasets.
arXiv Detail & Related papers (2021-03-16T12:22:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.