Spatial-Temporal Attention Network for Open-Set Fine-Grained Image
Recognition
- URL: http://arxiv.org/abs/2211.13940v1
- Date: Fri, 25 Nov 2022 07:46:42 GMT
- Title: Spatial-Temporal Attention Network for Open-Set Fine-Grained Image
Recognition
- Authors: Jiayin Sun, Hong Wang and Qiulei Dong
- Abstract summary: A vision transformer with the spatial self-attention mechanism could not learn accurate attention maps for distinguishing different categories of fine-grained images.
We propose a spatial-temporal attention network for learning fine-grained feature representations, called STAN.
The proposed STAN-OSFGR outperforms 9 state-of-the-art open-set recognition methods significantly in most cases.
- Score: 14.450381668547259
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Triggered by the success of transformers in various visual tasks, the spatial
self-attention mechanism has recently attracted more and more attention in the
computer vision community. However, we empirically found that a typical vision
transformer with the spatial self-attention mechanism could not learn accurate
attention maps for distinguishing different categories of fine-grained images.
To address this problem, motivated by the temporal attention mechanism in
brains, we propose a spatial-temporal attention network for learning
fine-grained feature representations, called STAN, where the features learnt by
implementing a sequence of spatial self-attention operations corresponding to
multiple moments are aggregated progressively. The proposed STAN consists of
four modules: a self-attention backbone module for learning a sequence of
features with self-attention operations, a spatial feature self-organizing
module for facilitating the model training, a spatial-temporal feature learning
module for aggregating the re-organized features via a Long Short-Term Memory
network, and a context-aware module that is implemented as the forget block of
the spatial-temporal feature learning module for preserving/forgetting the
long-term memory by utilizing contextual information. Then, we propose a
STAN-based method for open-set fine-grained recognition by integrating the
proposed STAN network with a linear classifier, called STAN-OSFGR. Extensive
experimental results on 3 fine-grained datasets and 2 coarse-grained datasets
demonstrate that the proposed STAN-OSFGR outperforms 9 state-of-the-art
open-set recognition methods significantly in most cases.
Related papers
- Multi-Scale Spatial-Temporal Self-Attention Graph Convolutional Networks for Skeleton-based Action Recognition [0.0]
In this paper, we propose self-attention GCN hybrid model, Multi-Scale Spatial-Temporal self-attention (MSST)-GCN.
We utilize spatial self-attention module with adaptive topology to understand intra-frame interactions within a frame among different body parts, and temporal self-attention module to examine correlations between frames of a node.
arXiv Detail & Related papers (2024-04-03T10:25:45Z) - Multi-Scale Spatial Temporal Graph Convolutional Network for
Skeleton-Based Action Recognition [13.15374205970988]
We present a multi-scale spatial graph convolution (MS-GC) module and a multi-scale temporal graph convolution (MT-GC) module.
The MS-GC and MT-GC modules decompose the corresponding local graph convolution into a set of sub-graph convolutions, forming a hierarchical residual architecture.
We propose a multi-scale spatial temporal graph convolutional network (MST-GCN), which stacks multiple blocks to learn effective motion representations for action recognition.
arXiv Detail & Related papers (2022-06-27T03:17:33Z) - A Spatio-Temporal Multilayer Perceptron for Gesture Recognition [70.34489104710366]
We propose a multilayer state-weighted perceptron for gesture recognition in the context of autonomous vehicles.
An evaluation of TCG and Drive&Act datasets is provided to showcase the promising performance of our approach.
We deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
arXiv Detail & Related papers (2022-04-25T08:42:47Z) - Self-Attention Neural Bag-of-Features [103.70855797025689]
We build on the recently introduced 2D-Attention and reformulate the attention learning methodology.
We propose a joint feature-temporal attention mechanism that learns a joint 2D attention mask highlighting relevant information.
arXiv Detail & Related papers (2022-01-26T17:54:14Z) - TCGL: Temporal Contrastive Graph for Self-supervised Video
Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL)
Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG)
To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z) - HAN: An Efficient Hierarchical Self-Attention Network for Skeleton-Based
Gesture Recognition [73.64451471862613]
We propose an efficient hierarchical self-attention network (HAN) for skeleton-based gesture recognition.
Joint self-attention module is used to capture spatial features of fingers, the finger self-attention module is designed to aggregate features of the whole hand.
Experiments show that our method achieves competitive results on three gesture recognition datasets with much lower computational complexity.
arXiv Detail & Related papers (2021-06-25T02:15:53Z) - Spatio-Temporal Analysis of Facial Actions using Lifecycle-Aware Capsule
Networks [12.552355581481994]
AULA-Caps learns between contiguous frames by focusing on relevant temporal-temporal segments in the sequence.
The learnt feature capsules are routed together such that the model learns to selectively focus on spatial ortemporal information depending upon the AU lifecycle.
The proposed model is evaluated on the commonly used BP4D and GFT benchmark datasets.
arXiv Detail & Related papers (2020-11-17T18:36:38Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z) - Spatial-Temporal Multi-Cue Network for Continuous Sign Language
Recognition [141.24314054768922]
We propose a spatial-temporal multi-cue (STMC) network to solve the vision-based sequence learning problem.
To validate the effectiveness, we perform experiments on three large-scale CSLR benchmarks.
arXiv Detail & Related papers (2020-02-08T15:38:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.