Wavelet-Decoupling Contrastive Enhancement Network for Fine-Grained
Skeleton-Based Action Recognition
- URL: http://arxiv.org/abs/2402.02210v1
- Date: Sat, 3 Feb 2024 16:51:04 GMT
- Title: Wavelet-Decoupling Contrastive Enhancement Network for Fine-Grained
Skeleton-Based Action Recognition
- Authors: Haochen Chang, Jing Chen, Yilin Li, Jixiang Chen, Xiaofeng Zhang
- Abstract summary: We propose a Wavelet-Attention Decoupling (WAD) module to disentangle salient and subtle motion features in the time-frequency domain.
We also propose a Fine-grained Contrastive Enhancement (FCE) module to enhance attention towards trajectory features by contrastive learning.
Our methods perform competitively compared to state-of-the-art methods and can discriminate confusing fine-grained actions well.
- Score: 8.743480762121937
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Skeleton-based action recognition has attracted much attention, benefiting
from its succinctness and robustness. However, the minimal inter-class
variation in similar action sequences often leads to confusion. The inherent
spatiotemporal coupling characteristics make it challenging to mine the subtle
differences in joint motion trajectories, which is critical for distinguishing
confusing fine-grained actions. To alleviate this problem, we propose a
Wavelet-Attention Decoupling (WAD) module that utilizes discrete wavelet
transform to effectively disentangle salient and subtle motion features in the
time-frequency domain. Then, the decoupling attention adaptively recalibrates
their temporal responses. To further amplify the discrepancies in these subtle
motion features, we propose a Fine-grained Contrastive Enhancement (FCE) module
to enhance attention towards trajectory features by contrastive learning.
Extensive experiments are conducted on the coarse-grained dataset NTU RGB+D and
the fine-grained dataset FineGYM. Our methods perform competitively compared to
state-of-the-art methods and can discriminate confusing fine-grained actions
well.
Related papers
- FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition [57.17966905865054]
Real-life applications of action recognition often require a fine-grained understanding of subtle movements.
Existing semi-supervised action recognition has mainly focused on coarse-grained action recognition.
We propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs.
arXiv Detail & Related papers (2024-09-02T20:08:06Z) - Deformable Feature Alignment and Refinement for Moving Infrared Dim-small Target Detection [17.765101100010224]
We propose a Deformable Feature Alignment and Refinement (DFAR) method based on deformable convolution to explicitly use motion context in both the training and inference stages.
The proposed DFAR method achieves the state-of-the-art performance on two benchmark datasets including DAUB and IRDST.
arXiv Detail & Related papers (2024-07-10T00:42:25Z) - An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - DeNoising-MOT: Towards Multiple Object Tracking with Severe Occlusions [52.63323657077447]
We propose DNMOT, an end-to-end trainable DeNoising Transformer for multiple object tracking.
Specifically, we augment the trajectory with noises during training and make our model learn the denoising process in an encoder-decoder architecture.
We conduct extensive experiments on the MOT17, MOT20, and DanceTrack datasets, and the experimental results show that our method outperforms previous state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2023-09-09T04:40:01Z) - Multi-Dimensional Refinement Graph Convolutional Network with Robust
Decouple Loss for Fine-Grained Skeleton-Based Action Recognition [19.031036881780107]
We propose a flexible attention block called Channel-Variable Spatial-Temporal Attention (CVSTA) to enhance the discriminative power of spatial-temporal joints.
Based on CVSTA, we construct a Multi-Dimensional Refinement Graph Convolutional Network (MDR-GCN), which can improve the discrimination among channel-, joint- and frame-level features.
Furthermore, we propose a Robust Decouple Loss (RDL), which significantly boosts the effect of the CVSTA and reduces the impact of noise.
arXiv Detail & Related papers (2023-06-27T09:23:36Z) - Decomposed Cross-modal Distillation for RGB-based Temporal Action
Detection [23.48709176879878]
Temporal action detection aims to predict the time intervals and the classes of action instances in the video.
Existing two-stream models exhibit slow inference speed due to their reliance on computationally expensive optical flow.
We introduce a cross-modal distillation framework to build a strong RGB-based detector by transferring knowledge of the motion modality.
arXiv Detail & Related papers (2023-03-30T10:47:26Z) - Spatiotemporal Multi-scale Bilateral Motion Network for Gait Recognition [3.1240043488226967]
In this paper, motivated by optical flow, the bilateral motion-oriented features are proposed.
We develop a set of multi-scale temporal representations that force the motion context to be richly described at various levels of temporal resolution.
arXiv Detail & Related papers (2022-09-26T01:36:22Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - ProgressiveMotionSeg: Mutually Reinforced Framework for Event-Based
Motion Segmentation [101.19290845597918]
This paper presents a Motion Estimation (ME) module and an Event Denoising (ED) module jointly optimized in a mutually reinforced manner.
Taking temporal correlation as guidance, ED module calculates the confidence that each event belongs to real activity events, and transmits it to ME module to update energy function of motion segmentation for noise suppression.
arXiv Detail & Related papers (2022-03-22T13:40:26Z) - Progressively Guide to Attend: An Iterative Alignment Framework for
Temporal Sentence Grounding [53.377028000325424]
We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task.
We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs.
We also devise a calibration module following each attention module to refine the alignment knowledge.
arXiv Detail & Related papers (2021-09-14T02:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.