MotionBERT: A Unified Perspective on Learning Human Motion
Representations
- URL: http://arxiv.org/abs/2210.06551v5
- Date: Mon, 14 Aug 2023 12:11:35 GMT
- Title: MotionBERT: A Unified Perspective on Learning Human Motion
Representations
- Authors: Wentao Zhu, Xiaoxuan Ma, Zhaoyang Liu, Libin Liu, Wayne Wu, Yizhou
Wang
- Abstract summary: We present a unified perspective on tackling various human-centric video tasks by learning human motion representations from large-scale and heterogeneous data resources.
We propose a pretraining stage in which a motion encoder is trained to recover the underlying 3D motion from noisy partial 2D observations.
We implement motion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer) neural network.
- Score: 46.67364057245364
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present a unified perspective on tackling various human-centric video
tasks by learning human motion representations from large-scale and
heterogeneous data resources. Specifically, we propose a pretraining stage in
which a motion encoder is trained to recover the underlying 3D motion from
noisy partial 2D observations. The motion representations acquired in this way
incorporate geometric, kinematic, and physical knowledge about human motion,
which can be easily transferred to multiple downstream tasks. We implement the
motion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer)
neural network. It could capture long-range spatio-temporal relationships among
the skeletal joints comprehensively and adaptively, exemplified by the lowest
3D pose estimation error so far when trained from scratch. Furthermore, our
proposed framework achieves state-of-the-art performance on all three
downstream tasks by simply finetuning the pretrained motion encoder with a
simple regression head (1-2 layers), which demonstrates the versatility of the
learned motion representations. Code and models are available at
https://motionbert.github.io/
Related papers
- Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment [45.74813582690906]
Learning 3D human motion from 2D inputs is a fundamental task in the realms of computer vision and computer graphics.
We present the Video-to-Motion Generator (VTM), which leverages motion priors through cross-modal latent feature space alignment.
The VTM showcases state-of-the-art performance in reconstructing 3D human motion from monocular videos.
arXiv Detail & Related papers (2024-04-15T06:38:09Z) - SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering [45.51684124904457]
We propose a new 4D motion paradigm, SurMo, that models the temporal dynamics and human appearances in a unified framework.
Surface-based motion encoding that models 4D human motions with an efficient compact surface-based triplane.
Physical motion decoding that is designed to encourage physical motion learning.
4D appearance modeling that renders the motion triplanes into images by an efficient surface-conditioned decoding.
arXiv Detail & Related papers (2024-04-01T16:34:27Z) - Universal Humanoid Motion Representations for Physics-Based Control [71.46142106079292]
We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control.
We first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset.
We then create our motion representation by distilling skills directly from the imitator.
arXiv Detail & Related papers (2023-10-06T20:48:43Z) - MoCaNet: Motion Retargeting in-the-wild via Canonicalization Networks [77.56526918859345]
We present a novel framework that brings the 3D motion task from controlled environments to in-the-wild scenarios.
It is capable of body motion from a character in a 2D monocular video to a 3D character without using any motion capture system or 3D reconstruction procedure.
arXiv Detail & Related papers (2021-12-19T07:52:05Z) - Action2video: Generating Videos of Human 3D Actions [31.665831044217363]
We aim to tackle the interesting yet challenging problem of generating videos of diverse and natural human motions from prescribed action categories.
Key issue lies in the ability to synthesize multiple distinct motion sequences that are realistic in their visual appearances.
Action2motionally generates plausible 3D pose sequences of a prescribed action category, which are processed and rendered by motion2video to form 2D videos.
arXiv Detail & Related papers (2021-11-12T20:20:37Z) - High-Fidelity Neural Human Motion Transfer from Monocular Video [71.75576402562247]
Video-based human motion transfer creates video animations of humans following a source motion.
We present a new framework which performs high-fidelity and temporally-consistent human motion transfer with natural pose-dependent non-rigid deformations.
In the experimental results, we significantly outperform the state-of-the-art in terms of video realism.
arXiv Detail & Related papers (2020-12-20T16:54:38Z) - Contact and Human Dynamics from Monocular Video [73.47466545178396]
Existing deep models predict 2D and 3D kinematic poses from video that are approximately accurate, but contain visible errors.
We present a physics-based method for inferring 3D human motion from video sequences that takes initial 2D and 3D pose estimates as input.
arXiv Detail & Related papers (2020-07-22T21:09:11Z) - Motion Guided 3D Pose Estimation from Videos [81.14443206968444]
We propose a new loss function, called motion loss, for the problem of monocular 3D Human pose estimation from 2D pose.
In computing motion loss, a simple yet effective representation for keypoint motion, called pairwise motion encoding, is introduced.
We design a new graph convolutional network architecture, U-shaped GCN (UGCN), which captures both short-term and long-term motion information.
arXiv Detail & Related papers (2020-04-29T06:59:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.