A Two-stream Hybrid CNN-Transformer Network for Skeleton-based Human
Interaction Recognition
- URL: http://arxiv.org/abs/2401.00409v1
- Date: Sun, 31 Dec 2023 06:46:46 GMT
- Title: A Two-stream Hybrid CNN-Transformer Network for Skeleton-based Human
Interaction Recognition
- Authors: Ruoqi Yin, Jianqin Yin
- Abstract summary: We propose a Two-stream Hybrid CNN-Transformer Network (THCT-Net)
It exploits the local specificity of CNN and models global dependencies through the Transformer.
We show that the proposed method can better comprehend and infer the meaning and context of various actions, outperforming state-of-the-art methods.
- Score: 6.490564374810672
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human Interaction Recognition is the process of identifying interactive
actions between multiple participants in a specific situation. The aim is to
recognise the action interactions between multiple entities and their meaning.
Many single Convolutional Neural Network has issues, such as the inability to
capture global instance interaction features or difficulty in training, leading
to ambiguity in action semantics. In addition, the computational complexity of
the Transformer cannot be ignored, and its ability to capture local information
and motion features in the image is poor. In this work, we propose a Two-stream
Hybrid CNN-Transformer Network (THCT-Net), which exploits the local specificity
of CNN and models global dependencies through the Transformer. CNN and
Transformer simultaneously model the entity, time and space relationships
between interactive entities respectively. Specifically, Transformer-based
stream integrates 3D convolutions with multi-head self-attention to learn
inter-token correlations; We propose a new multi-branch CNN framework for
CNN-based streams that automatically learns joint spatio-temporal features from
skeleton sequences. The convolutional layer independently learns the local
features of each joint neighborhood and aggregates the features of all joints.
And the raw skeleton coordinates as well as their temporal difference are
integrated with a dual-branch paradigm to fuse the motion features of the
skeleton. Besides, a residual structure is added to speed up training
convergence. Finally, the recognition results of the two branches are fused
using parallel splicing. Experimental results on diverse and challenging
datasets, demonstrate that the proposed method can better comprehend and infer
the meaning and context of various actions, outperforming state-of-the-art
methods.
Related papers
- Interaction-Guided Two-Branch Image Dehazing Network [1.26404863283601]
Image dehazing aims to restore clean images from hazy ones.
CNNs and Transformers have demonstrated exceptional performance in local and global feature extraction.
We propose a novel dual-branch image dehazing framework that guides CNN and Transformer components interactively.
arXiv Detail & Related papers (2024-10-14T03:21:56Z) - Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition [45.0131792009999]
We propose a point cloud-based network named Two-stream Multi-level Dynamic Point Transformer for two-person interaction recognition.
Our model addresses the challenge of recognizing two-person interactions by incorporating local-region spatial information, appearance information, and motion information.
Our network outperforms state-of-the-art approaches in most standard evaluation settings.
arXiv Detail & Related papers (2023-07-22T03:51:32Z) - Interactive Spatiotemporal Token Attention Network for Skeleton-based
General Interactive Action Recognition [8.513434732050749]
We propose an Interactive Spatiotemporal Token Attention Network (ISTA-Net), which simultaneously model spatial, temporal, and interactive relations.
Our network contains a tokenizer to partition Interactive Spatiotemporal Tokens (ISTs), which is a unified way to represent motions of multiple diverse entities.
To jointly learn along three dimensions in ISTs, multi-head self-attention blocks integrated with 3D convolutions are designed to capture inter-token correlations.
arXiv Detail & Related papers (2023-07-14T16:51:25Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Skeleton-based Action Recognition through Contrasting Two-Stream
Spatial-Temporal Networks [11.66009967197084]
We propose a novel Contrastive GCN-Transformer Network (ConGT) which fuses the spatial and temporal modules in a parallel way.
We conduct experiments on three benchmark datasets, which demonstrate that our model achieves state-of-the-art performance in action recognition.
arXiv Detail & Related papers (2023-01-27T02:12:08Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Dense Interaction Learning for Video-based Person Re-identification [75.03200492219003]
We propose a hybrid framework, Dense Interaction Learning (DenseIL), to tackle video-based person re-ID difficulties.
DenseIL contains a CNN encoder and a Dense Interaction (DI) decoder.
Our experiments consistently and significantly outperform all the state-of-the-art methods on multiple standard video-based re-ID datasets.
arXiv Detail & Related papers (2021-03-16T12:22:08Z) - Learning Asynchronous and Sparse Human-Object Interaction in Videos [56.73059840294019]
Asynchronous-Sparse Interaction Graph Networks (ASSIGN) is able to automatically detect the structure of interaction events associated with entities in a video scene.
ASSIGN is tested on human-object interaction recognition and shows superior performance in segmenting and labeling of human sub-activities and object affordances from raw videos.
arXiv Detail & Related papers (2021-03-03T23:43:55Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.