IIP-Transformer: Intra-Inter-Part Transformer for Skeleton-Based Action
Recognition
- URL: http://arxiv.org/abs/2110.13385v1
- Date: Tue, 26 Oct 2021 03:24:22 GMT
- Title: IIP-Transformer: Intra-Inter-Part Transformer for Skeleton-Based Action
Recognition
- Authors: Qingtian Wang, Jianlin Peng, Shuze Shi, Tingxi Liu, Jiabin He,
Renliang Weng
- Abstract summary: We propose a novel Transformer-based network (IIP-Transformer) for skeleton-based action recognition tasks.
Instead of exploiting interactions among individual joints, our IIP-Transformer incorporates body joints and parts interactions simultaneously.
The proposed IIP-Transformer achieves the-state-of-art performance with more than 8x less computational complexity than DSTA-Net.
- Score: 0.5953569982292298
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Transformer-based networks have shown great promise on
skeleton-based action recognition tasks. The ability to capture global and
local dependencies is the key to success while it also brings quadratic
computation and memory cost. Another problem is that previous studies mainly
focus on the relationships among individual joints, which often suffers from
the noisy skeleton joints introduced by the noisy inputs of sensors or
inaccurate estimations. To address the above issues, we propose a novel
Transformer-based network (IIP-Transformer). Instead of exploiting interactions
among individual joints, our IIP-Transformer incorporates body joints and parts
interactions simultaneously and thus can capture both joint-level (intra-part)
and part-level (inter-part) dependencies efficiently and effectively. From the
data aspect, we introduce a part-level skeleton data encoding that
significantly reduces the computational complexity and is more robust to
joint-level skeleton noise. Besides, a new part-level data augmentation is
proposed to improve the performance of the model. On two large-scale datasets,
NTU-RGB+D 60 and NTU RGB+D 120, the proposed IIP-Transformer achieves
the-state-of-art performance with more than 8x less computational complexity
than DSTA-Net, which is the SOTA Transformer-based method.
Related papers
- Efficient Semantic Segmentation via Lightweight Multiple-Information Interaction Network [37.84039482457571]
We propose a lightweight multiple-information interaction network for real-time semantic segmentation, called LMIINet.
It effectively combines CNNs and Transformers while reducing redundant computations and memory footprint.
With only 0.72M parameters and 11.74G FLOPs, LMIINet achieves 72.0% mIoU at 100 FPS on the Cityscapes test set and 69.94% mIoU at 160 FPS on the CamVid dataset.
arXiv Detail & Related papers (2024-10-03T05:45:24Z) - Unifying Dimensions: A Linear Adaptive Approach to Lightweight Image Super-Resolution [6.857919231112562]
Window-based transformers have demonstrated outstanding performance in super-resolution tasks.
They exhibit higher computational complexity and inference latency than convolutional neural networks.
We construct a convolution-based Transformer framework named the linear adaptive mixer network (LAMNet)
arXiv Detail & Related papers (2024-09-26T07:24:09Z) - ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection [65.59969454655996]
We propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions.
Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks.
We also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings.
arXiv Detail & Related papers (2024-03-26T17:46:25Z) - Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST)
CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background.
Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z) - Efficient and Accurate Skeleton-Based Two-Person Interaction Recognition
Using Inter- and Intra-body Graphs [7.563146292108742]
We propose a lightweight model for accurately recognizing two-person interactions.
In addition to the architecture, which incorporates middle fusion, we introduce a factorized convolution technique to reduce the weight parameters.
We also introduce a network stream that accounts for relative distance changes between inter-body joints to improve accuracy.
arXiv Detail & Related papers (2022-07-26T04:28:40Z) - Transformer-based Context Condensation for Boosting Feature Pyramids in
Object Detection [77.50110439560152]
Current object detectors typically have a feature pyramid (FP) module for multi-level feature fusion (MFF)
We propose a novel and efficient context modeling mechanism that can help existing FPs deliver better MFF results.
In particular, we introduce a novel insight that comprehensive contexts can be decomposed and condensed into two types of representations for higher efficiency.
arXiv Detail & Related papers (2022-07-14T01:45:03Z) - nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution.
nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z) - Solving Mixed Integer Programs Using Neural Networks [57.683491412480635]
This paper applies learning to the two key sub-tasks of a MIP solver, generating a high-quality joint variable assignment, and bounding the gap in objective value between that assignment and an optimal one.
Our approach constructs two corresponding neural network-based components, Neural Diving and Neural Branching, to use in a base MIP solver such as SCIP.
We evaluate our approach on six diverse real-world datasets, including two Google production datasets and MIPLIB, by training separate neural networks on each.
arXiv Detail & Related papers (2020-12-23T09:33:11Z) - Spatial Temporal Transformer Network for Skeleton-based Action
Recognition [12.117737635879037]
We propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints.
In our ST-TR model, a Spatial Self-Attention module (SSA) is used to understand intra-frame interactions between different body parts, and a Temporal Self-Attention module (TSA) to model inter-frame correlations.
The two are combined in a two-stream network which outperforms state-of-the-art models using the same input data.
arXiv Detail & Related papers (2020-12-11T14:58:21Z) - Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks [75.69896269357005]
Mixup is the latest data augmentation technique that linearly interpolates input examples and the corresponding labels.
In this paper, we explore how to apply mixup to natural language processing tasks.
We incorporate mixup to transformer-based pre-trained architecture, named "mixup-transformer", for a wide range of NLP tasks.
arXiv Detail & Related papers (2020-10-05T23:37:30Z) - Skeleton-based Action Recognition via Spatial and Temporal Transformer
Networks [12.06555892772049]
We propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator.
The proposed ST-TR achieves state-of-the-art performance on all datasets when using joints' coordinates as input, and results on-par with state-of-the-art when adding bones information.
arXiv Detail & Related papers (2020-08-17T15:25:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.