SkelVIT: Consensus of Vision Transformers for a Lightweight
Skeleton-Based Action Recognition System
- URL: http://arxiv.org/abs/2311.08094v2
- Date: Thu, 7 Mar 2024 07:20:50 GMT
- Title: SkelVIT: Consensus of Vision Transformers for a Lightweight
Skeleton-Based Action Recognition System
- Authors: Ozge Oztimur Karadag
- Abstract summary: Skeleton-based action recognition receives the attention of many researchers as it is robust to viewpoint and illumination changes.
With the emergence of deep learning models, it has become very popular to represent the skeleton data in pseudo-image form and apply CNN for action recognition.
Recently, attention networks, more specifically transformers have provided promising results in various vision problems.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Skeleton-based action recognition receives the attention of many researchers
as it is robust to viewpoint and illumination changes, and its processing is
much more efficient than the processing of video frames. With the emergence of
deep learning models, it has become very popular to represent the skeleton data
in pseudo-image form and apply CNN for action recognition. Thereafter, studies
concentrated on finding effective methods for forming pseudo-images. Recently,
attention networks, more specifically transformers have provided promising
results in various vision problems. In this study, the effectiveness of VIT for
skeleton-based action recognition is examined and its robustness on the
pseudo-image representation scheme is investigated. To this end, a three-level
architecture, SkelVit is proposed, which forms a set of pseudo images, applies
a classifier on each of the representations, and combines their results to find
the final action class. The performance of SkelVit is examined thoroughly via a
set of experiments. First, the sensitivity of the system to representation is
investigated by comparing it with two of the state-of-the-art pseudo-image
representation methods. Then, the classifiers of SkelVit are realized in two
experimental setups by CNNs and VITs, and their performances are compared. In
the final experimental setup, the contribution of combining classifiers is
examined by applying the model with a different number of classifiers.
Experimental studies reveal that the proposed system with its lightweight
representation scheme achieves better results than the state-of-the-art
methods. It is also observed that the vision transformer is less sensitive to
the initial pseudo-image representation compared to CNN. Nevertheless, even
with the vision transformer, the recognition performance can be further
improved by the consensus of classifiers.
Related papers
- Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling [58.50618448027103]
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning.
This paper explores the differences across various CLIP-trained vision backbones.
Method achieves a remarkable increase in accuracy of up to 39.1% over the best single backbone.
arXiv Detail & Related papers (2024-05-27T12:59:35Z) - Neural Clustering based Visual Representation Learning [61.72646814537163]
Clustering is one of the most classic approaches in machine learning and data analysis.
We propose feature extraction with clustering (FEC), which views feature extraction as a process of selecting representatives from data.
FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives.
arXiv Detail & Related papers (2024-03-26T06:04:50Z) - Visual Imitation Learning with Calibrated Contrastive Representation [44.63125396964309]
Adversarial Imitation Learning (AIL) allows the agent to reproduce expert behavior with low-dimensional states and actions.
This paper proposes a simple and effective solution by incorporating contrastive representative learning into visual AIL framework.
arXiv Detail & Related papers (2024-01-21T04:18:30Z) - Unveiling Backbone Effects in CLIP: Exploring Representational Synergies
and Variances [49.631908848868505]
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning.
We investigate the differences in CLIP performance among various neural architectures.
We propose a simple, yet effective approach to combine predictions from multiple backbones, leading to a notable performance boost of up to 6.34%.
arXiv Detail & Related papers (2023-12-22T03:01:41Z) - Foveation in the Era of Deep Learning [6.602118206533142]
We introduce an end-to-end differentiable foveated active vision architecture that leverages a graph convolutional network to process foveated images.
Our model learns to iteratively attend to regions of the image relevant for classification.
We find that our model outperforms a state-of-the-art CNN and foveated vision architectures of comparable parameters and a given pixel or budget.
arXiv Detail & Related papers (2023-12-03T16:48:09Z) - SkeleTR: Towrads Skeleton-based Action Recognition in the Wild [86.03082891242698]
SkeleTR is a new framework for skeleton-based action recognition.
It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions.
It then uses stacked Transformer encoders to capture person interactions that are important for action recognition in general scenarios.
arXiv Detail & Related papers (2023-09-20T16:22:33Z) - Human Action Recognition in Still Images Using ConViT [0.11510009152620665]
This paper proposes a new module that functions like a convolutional layer that uses Vision Transformer (ViT)
It is shown that the proposed model, compared to a simple CNN, can extract meaningful parts of an image and suppress the misleading parts.
arXiv Detail & Related papers (2023-07-18T06:15:23Z) - Leveraging Systematic Knowledge of 2D Transformations [6.668181653599057]
Humans have a remarkable ability to interpret images, even if the scenes in the images are rare.
This work focuses on 1) the acquisition of systematic knowledge of 2D transformations, and 2) architectural components that can leverage the learned knowledge in image classification tasks.
arXiv Detail & Related papers (2022-06-02T06:46:12Z) - Prune and distill: similar reformatting of image information along rat
visual cortex and deep neural networks [61.60177890353585]
Deep convolutional neural networks (CNNs) have been shown to provide excellent models for its functional analogue in the brain, the ventral stream in visual cortex.
Here we consider some prominent statistical patterns that are known to exist in the internal representations of either CNNs or the visual cortex.
We show that CNNs and visual cortex share a similarly tight relationship between dimensionality expansion/reduction of object representations and reformatting of image information.
arXiv Detail & Related papers (2022-05-27T08:06:40Z) - Efficient Self-supervised Vision Transformers for Representation
Learning [86.57557009109411]
We show that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity.
We propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies.
Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation.
arXiv Detail & Related papers (2021-06-17T19:57:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.