A Decoupled Spatio-Temporal Framework for Skeleton-based Action
Segmentation
- URL: http://arxiv.org/abs/2312.05830v1
- Date: Sun, 10 Dec 2023 09:11:39 GMT
- Title: A Decoupled Spatio-Temporal Framework for Skeleton-based Action
Segmentation
- Authors: Yunheng Li, Zhongyu Li, Shanghua Gao, Qilong Wang, Qibin Hou,
Ming-Ming Cheng
- Abstract summary: Existing methods are limited in weak-temporal modeling capability.
We propose a Decoupled Scoupled Framework (DeST) to address the issues.
DeST significantly outperforms current state-of-the-art methods with less computational complexity.
- Score: 89.86345494602642
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Effectively modeling discriminative spatio-temporal information is essential
for segmenting activities in long action sequences. However, we observe that
existing methods are limited in weak spatio-temporal modeling capability due to
two forms of decoupled modeling: (i) cascaded interaction couples spatial and
temporal modeling, which over-smooths motion modeling over the long sequence,
and (ii) joint-shared temporal modeling adopts shared weights to model each
joint, ignoring the distinct motion patterns of different joints. We propose a
Decoupled Spatio-Temporal Framework (DeST) to address the above issues.
Firstly, we decouple the cascaded spatio-temporal interaction to avoid stacking
multiple spatio-temporal blocks, while achieving sufficient spatio-temporal
interaction. Specifically, DeST performs once unified spatial modeling and
divides the spatial features into different groups of subfeatures, which then
adaptively interact with temporal features from different layers. Since the
different sub-features contain distinct spatial semantics, the model could
learn the optimal interaction pattern at each layer. Meanwhile, inspired by the
fact that different joints move at different speeds, we propose joint-decoupled
temporal modeling, which employs independent trainable weights to capture
distinctive temporal features of each joint. On four large-scale benchmarks of
different scenes, DeST significantly outperforms current state-of-the-art
methods with less computational complexity.
Related papers
- Multi-Scale Spatial-Temporal Self-Attention Graph Convolutional Networks for Skeleton-based Action Recognition [0.0]
In this paper, we propose self-attention GCN hybrid model, Multi-Scale Spatial-Temporal self-attention (MSST)-GCN.
We utilize spatial self-attention module with adaptive topology to understand intra-frame interactions within a frame among different body parts, and temporal self-attention module to examine correlations between frames of a node.
arXiv Detail & Related papers (2024-04-03T10:25:45Z) - Video-Based Human Pose Regression via Decoupled Space-Time Aggregation [0.5524804393257919]
We develop an efficient and effective video-based human pose regression method, which bypasses intermediate representations such as asmaps and instead directly maps the input to the joint coordinates.
Our method is capable of efficiently and flexibly utilizing the spatial dependency of adjacent joints and the temporal dependency of each joint itself.
Our approach either surpasses or is on par with the state-of-the-art heatmap-based multi-frame human pose estimation methods.
arXiv Detail & Related papers (2024-03-29T02:26:22Z) - Generative Hierarchical Temporal Transformer for Hand Pose and Action Modeling [67.94143911629143]
We propose a generative Transformer VAE architecture to model hand pose and action.
To faithfully model the semantic dependency and different temporal granularity of hand pose and action, we decompose the framework into two cascaded VAE blocks.
Results show that our joint modeling of recognition and prediction improves over isolated solutions.
arXiv Detail & Related papers (2023-11-29T05:28:39Z) - Spatio-Temporal Branching for Motion Prediction using Motion Increments [55.68088298632865]
Human motion prediction (HMP) has emerged as a popular research topic due to its diverse applications.
Traditional methods rely on hand-crafted features and machine learning techniques.
We propose a noveltemporal-temporal branching network using incremental information for HMP.
arXiv Detail & Related papers (2023-08-02T12:04:28Z) - Spatio-temporal Diffusion Point Processes [23.74522530140201]
patio-temporal point process (STPP) is a collection of events accompanied with time and space.
The failure to model the joint distribution leads to limited capacities in characterizing the pasthua-temporal interactions given events.
We propose a novel parameterization framework, which learns complex spatial-temporal joint distributions.
Our framework outperforms the state-of-the-art baselines remarkably, with an average improvement over 50%.
arXiv Detail & Related papers (2023-05-21T08:53:00Z) - Spatial Temporal Graph Attention Network for Skeleton-Based Action
Recognition [10.60209288486904]
It's common for current methods in skeleton-based action recognition to mainly consider capturing long-term temporal dependencies.
We propose a general framework, coined as STGAT, to model cross-spacetime information flow.
STGAT achieves state-of-the-art performance on three large-scale datasets.
arXiv Detail & Related papers (2022-08-18T02:34:46Z) - Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based
Motion Recognition [62.46544616232238]
Previous motion recognition methods have achieved promising performance through the tightly coupled multi-temporal representation.
We propose to decouple and recouple caused caused representation for RGB-D-based motion recognition.
arXiv Detail & Related papers (2021-12-16T18:59:47Z) - Spatio-Temporal Joint Graph Convolutional Networks for Traffic
Forecasting [75.10017445699532]
Recent have shifted their focus towards formulating traffic forecasting as atemporal graph modeling problem.
We propose a novel approach for accurate traffic forecasting on road networks over multiple future time steps.
arXiv Detail & Related papers (2021-11-25T08:45:14Z) - TSI: Temporal Saliency Integration for Video Action Recognition [32.18535820790586]
We propose a Temporal Saliency Integration (TSI) block, which mainly contains a Salient Motion Excitation (SME) module and a Cross-scale Temporal Integration (CTI) module.
SME aims to highlight the motion-sensitive area through local-global motion modeling.
CTI is designed to perform multi-scale temporal modeling through a group of separate 1D convolutions respectively.
arXiv Detail & Related papers (2021-06-02T11:43:49Z) - Disentangling and Unifying Graph Convolutions for Skeleton-Based Action
Recognition [79.33539539956186]
We propose a simple method to disentangle multi-scale graph convolutions and a unified spatial-temporal graph convolutional operator named G3D.
By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets.
arXiv Detail & Related papers (2020-03-31T11:28:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.