Masked Motion Predictors are Strong 3D Action Representation Learners
- URL: http://arxiv.org/abs/2308.07092v1
- Date: Mon, 14 Aug 2023 11:56:39 GMT
- Title: Masked Motion Predictors are Strong 3D Action Representation Learners
- Authors: Yunyao Mao, Jiajun Deng, Wengang Zhou, Yao Fang, Wanli Ouyang,
Houqiang Li
- Abstract summary: In 3D human action recognition, limited supervised data makes it challenging to fully tap into the modeling potential of powerful networks such as transformers.
We show that instead of following the prevalent pretext to perform masked self-component reconstruction in human joints, explicit contextual motion modeling is key to the success of learning effective feature representation for 3D action recognition.
- Score: 143.9677635274393
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In 3D human action recognition, limited supervised data makes it challenging
to fully tap into the modeling potential of powerful networks such as
transformers. As a result, researchers have been actively investigating
effective self-supervised pre-training strategies. In this work, we show that
instead of following the prevalent pretext task to perform masked
self-component reconstruction in human joints, explicit contextual motion
modeling is key to the success of learning effective feature representation for
3D action recognition. Formally, we propose the Masked Motion Prediction (MAMP)
framework. To be specific, the proposed MAMP takes as input the masked
spatio-temporal skeleton sequence and predicts the corresponding temporal
motion of the masked human joints. Considering the high temporal redundancy of
the skeleton sequence, in our MAMP, the motion information also acts as an
empirical semantic richness prior that guide the masking process, promoting
better attention to semantically rich temporal regions. Extensive experiments
on NTU-60, NTU-120, and PKU-MMD datasets show that the proposed MAMP
pre-training substantially improves the performance of the adopted vanilla
transformer, achieving state-of-the-art results without bells and whistles. The
source code of our MAMP is available at https://github.com/maoyunyao/MAMP.
Related papers
- Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction [88.65168366064061]
We introduce Discrete Denoising Posterior Prediction (DDPP), a novel framework that casts the task of steering pre-trained MDMs as a problem of probabilistic inference.
Our framework leads to a family of three novel objectives that are all simulation-free, and thus scalable.
We substantiate our designs via wet-lab validation, where we observe transient expression of reward-optimized protein sequences.
arXiv Detail & Related papers (2024-10-10T17:18:30Z) - MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation [44.74056930805525]
We introduce a novel Masked Diffusion Transformer for co-speech gesture generation, referred to as MDT-A2G.
This model employs a mask modeling scheme specifically designed to strengthen temporal relation learning among sequence gestures.
Experimental results demonstrate that MDT-A2G excels in gesture generation, boasting a learning speed that is over 6$times$ faster than traditional diffusion transformers.
arXiv Detail & Related papers (2024-08-06T17:29:01Z) - Past Movements-Guided Motion Representation Learning for Human Motion Prediction [0.0]
We propose a self-supervised learning framework designed to enhance motion representation.
The framework consists of two stages: first, the network is pretrained through the self-reconstruction of past sequences, and the guided reconstruction of future sequences based on past movements.
Our method reduces the average prediction errors by 8.8% across Human3.6, 3DPW, and AMASS datasets.
arXiv Detail & Related papers (2024-08-04T17:00:37Z) - Three-Stream Temporal-Shift Attention Network Based on Self-Knowledge Distillation for Micro-Expression Recognition [21.675660978188617]
Micro-expression recognition is crucial in many fields, including criminal analysis and psychotherapy.
A three-stream temporal-shift attention network based on self-knowledge distillation called SKD-TSTSAN is proposed in this paper.
arXiv Detail & Related papers (2024-06-25T13:22:22Z) - SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation [62.58480650443393]
Segment Anything (SAM) is a vision-foundation model for generalizable scene understanding and sequence imitation.
We develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass.
arXiv Detail & Related papers (2024-05-30T00:32:51Z) - Skeleton2vec: A Self-supervised Learning Framework with Contextualized
Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance.
Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework.
Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z) - Prompted Contrast with Masked Motion Modeling: Towards Versatile 3D
Action Representation Learning [33.68311764817763]
We propose Prompted Contrast with Masked Motion Modeling, PCM$rm 3$, for versatile 3D action representation learning.
Our method integrates the contrastive learning and masked prediction tasks in a mutually beneficial manner.
Tests on five downstream tasks under three large-scale datasets are conducted, demonstrating the superior generalization capacity of PCM$rm 3$ compared to the state-of-the-art works.
arXiv Detail & Related papers (2023-08-08T01:27:55Z) - Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning.
We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z) - Transformer Inertial Poser: Attention-based Real-time Human Motion
Reconstruction from Sparse IMUs [79.72586714047199]
We propose an attention-based deep learning method to reconstruct full-body motion from six IMU sensors in real-time.
Our method achieves new state-of-the-art results both quantitatively and qualitatively, while being simple to implement and smaller in size.
arXiv Detail & Related papers (2022-03-29T16:24:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.