Skeleton-based Action Recognition through Contrasting Two-Stream
Spatial-Temporal Networks
- URL: http://arxiv.org/abs/2301.11495v1
- Date: Fri, 27 Jan 2023 02:12:08 GMT
- Title: Skeleton-based Action Recognition through Contrasting Two-Stream
Spatial-Temporal Networks
- Authors: Chen Pang, Xuequan Lu, Lei Lyu
- Abstract summary: We propose a novel Contrastive GCN-Transformer Network (ConGT) which fuses the spatial and temporal modules in a parallel way.
We conduct experiments on three benchmark datasets, which demonstrate that our model achieves state-of-the-art performance in action recognition.
- Score: 11.66009967197084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For pursuing accurate skeleton-based action recognition, most prior methods
use the strategy of combining Graph Convolution Networks (GCNs) with
attention-based methods in a serial way. However, they regard the human
skeleton as a complete graph, resulting in less variations between different
actions (e.g., the connection between the elbow and head in action ``clapping
hands''). For this, we propose a novel Contrastive GCN-Transformer Network
(ConGT) which fuses the spatial and temporal modules in a parallel way. The
ConGT involves two parallel streams: Spatial-Temporal Graph Convolution stream
(STG) and Spatial-Temporal Transformer stream (STT). The STG is designed to
obtain action representations maintaining the natural topology structure of the
human skeleton. The STT is devised to acquire action representations containing
the global relationships among joints. Since the action representations
produced from these two streams contain different characteristics, and each of
them knows little information of the other, we introduce the contrastive
learning paradigm to guide their output representations of the same sample to
be as close as possible in a self-supervised manner. Through the contrastive
learning, they can learn information from each other to enrich the action
features by maximizing the mutual information between the two types of action
representations. To further improve action recognition accuracy, we introduce
the Cyclical Focal Loss (CFL) which can focus on confident training samples in
early training epochs, with an increasing focus on hard samples during the
middle epochs. We conduct experiments on three benchmark datasets, which
demonstrate that our model achieves state-of-the-art performance in action
recognition.
Related papers
- S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Two-person Graph Convolutional Network for Skeleton-based Human
Interaction Recognition [11.650290790796323]
Graph Convolutional Network (GCN) outperforms previous methods in the skeleton-based human action recognition area.
We introduce a novel unified two-person graph representing spatial interaction correlations between joints.
Experiments show accuracy improvements in both interactions and individual actions when utilizing the proposed two-person graph topology.
arXiv Detail & Related papers (2022-08-12T08:50:15Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Combining the Silhouette and Skeleton Data for Gait Recognition [13.345465199699]
Two dominant gait recognition works are appearance-based and model-based, which extract features from silhouettes and skeletons, respectively.
This paper proposes a CNN-based branch taking silhouettes as input and a GCN-based branch taking skeletons as input.
For better gait representation in the GCN-based branch, we present a fully connected graph convolution operator to integrate multi-scale graph convolutions.
arXiv Detail & Related papers (2022-02-22T03:21:51Z) - Joint-bone Fusion Graph Convolutional Network for Semi-supervised
Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder.
Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream.
The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z) - Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based
Action Recognition [49.163326827954656]
We propose a novel multi-granular-temporal graph network for skeleton-based action classification.
We develop a dual-head graph network consisting of two inter-leaved branches, which enables us to extract at least two-temporal resolutions.
We conduct extensive experiments on three large-scale datasets.
arXiv Detail & Related papers (2021-08-10T09:25:07Z) - Sequential convolutional network for behavioral pattern extraction in
gait recognition [0.7874708385247353]
We propose a sequential convolutional network (SCN) to learn the walking pattern of individuals.
In SCN, behavioral information extractors (BIE) are constructed to comprehend intermediate feature maps in time series.
A multi-frame aggregator in SCN performs feature integration on a sequence whose length is uncertain, via a mobile 3D convolutional layer.
arXiv Detail & Related papers (2021-04-23T08:44:10Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.