Human Action Recognition in Still Images Using ConViT
- URL: http://arxiv.org/abs/2307.08994v3
- Date: Thu, 11 Jan 2024 11:17:55 GMT
- Title: Human Action Recognition in Still Images Using ConViT
- Authors: Seyed Rohollah Hosseyni, Sanaz Seyedin, Hasan Taheri
- Abstract summary: This paper proposes a new module that functions like a convolutional layer that uses Vision Transformer (ViT)
It is shown that the proposed model, compared to a simple CNN, can extract meaningful parts of an image and suppress the misleading parts.
- Score: 0.11510009152620665
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Understanding the relationship between different parts of an image is crucial
in a variety of applications, including object recognition, scene
understanding, and image classification. Despite the fact that Convolutional
Neural Networks (CNNs) have demonstrated impressive results in classifying and
detecting objects, they lack the capability to extract the relationship between
different parts of an image, which is a crucial factor in Human Action
Recognition (HAR). To address this problem, this paper proposes a new module
that functions like a convolutional layer that uses Vision Transformer (ViT).
In the proposed model, the Vision Transformer can complement a convolutional
neural network in a variety of tasks by helping it to effectively extract the
relationship among various parts of an image. It is shown that the proposed
model, compared to a simple CNN, can extract meaningful parts of an image and
suppress the misleading parts. The proposed model has been evaluated on the
Stanford40 and PASCAL VOC 2012 action datasets and has achieved 95.5% mean
Average Precision (mAP) and 91.5% mAP results, respectively, which are
promising compared to other state-of-the-art methods.
Related papers
- Are Deep Learning Models Robust to Partial Object Occlusion in Visual Recognition Tasks? [4.9260675787714]
Image classification models, including convolutional neural networks (CNNs), perform well on a variety of classification tasks but struggle under partial occlusion.
We contribute the Image Recognition Under Occlusion (IRUO) dataset, based on the recently developed Occluded Video Instance (IRUO) dataset (arXiv:2102.01558)
We find that modern CNN-based models show improved recognition accuracy on occluded images compared to earlier CNN-based models, and ViT-based models are more accurate than CNN-based models on occluded images.
arXiv Detail & Related papers (2024-09-16T23:21:22Z) - Efficient Visual State Space Model for Image Deblurring [83.57239834238035]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration.
We propose a simple yet effective visual state space model (EVSSM) for image deblurring.
arXiv Detail & Related papers (2024-05-23T09:13:36Z) - Foveation in the Era of Deep Learning [6.602118206533142]
We introduce an end-to-end differentiable foveated active vision architecture that leverages a graph convolutional network to process foveated images.
Our model learns to iteratively attend to regions of the image relevant for classification.
We find that our model outperforms a state-of-the-art CNN and foveated vision architectures of comparable parameters and a given pixel or budget.
arXiv Detail & Related papers (2023-12-03T16:48:09Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - Two Approaches to Supervised Image Segmentation [55.616364225463066]
The present work develops comparison experiments between deep learning and multiset neurons approaches.
The deep learning approach confirmed its potential for performing image segmentation.
The alternative multiset methodology allowed for enhanced accuracy while requiring little computational resources.
arXiv Detail & Related papers (2023-07-19T16:42:52Z) - Convolutional neural network based on sparse graph attention mechanism
for MRI super-resolution [0.34410212782758043]
Medical image super-resolution (SR) reconstruction using deep learning techniques can enhance lesion analysis and assist doctors in improving diagnostic efficiency and accuracy.
Existing deep learning-based SR methods rely on convolutional neural networks (CNNs), which inherently limit the expressive capabilities of these models.
We propose an A-network that utilizes multiple convolution operator feature extraction modules (MCO) for extracting image features.
arXiv Detail & Related papers (2023-05-29T06:14:22Z) - AMIGO: Sparse Multi-Modal Graph Transformer with Shared-Context
Processing for Representation Learning of Giga-pixel Images [53.29794593104923]
We present a novel concept of shared-context processing for whole slide histopathology images.
AMIGO uses the celluar graph within the tissue to provide a single representation for a patient.
We show that our model is strongly robust to missing information to an extent that it can achieve the same performance with as low as 20% of the data.
arXiv Detail & Related papers (2023-03-01T23:37:45Z) - Saccade Mechanisms for Image Classification, Object Detection and
Tracking [12.751552698602744]
We examine how the saccade mechanism from biological vision can be used to make deep neural networks more efficient for classification and object detection problems.
Our proposed approach is based on the ideas of attention-driven visual processing and saccades, miniature eye movements influenced by attention.
arXiv Detail & Related papers (2022-06-10T13:50:34Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Projected Distribution Loss for Image Enhancement [15.297569497776374]
We show that aggregating 1D-Wasserstein distances between CNN activations is more reliable than the existing approaches.
In imaging applications such as denoising, super-resolution, demosaicing, deblurring and JPEG artifact removal, the proposed learning loss outperforms the current state-of-the-art on reference-based perceptual losses.
arXiv Detail & Related papers (2020-12-16T22:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.