Human MotionFormer: Transferring Human Motions with Vision Transformers
- URL: http://arxiv.org/abs/2302.11306v1
- Date: Wed, 22 Feb 2023 11:42:44 GMT
- Title: Human MotionFormer: Transferring Human Motions with Vision Transformers
- Authors: Hongyu Liu and Xintong Han and ChengBin Jin and Huawei Wei and Zhe Lin
and Faqiang Wang and Haoye Dong and Yibing Song and Jia Xu and Qifeng Chen
- Abstract summary: Human motion transfer aims to transfer motions from a target dynamic person to a source static one for motion synthesis.
We propose Human MotionFormer, a hierarchical ViT framework that leverages global and local perceptions to capture large and subtle motion matching.
Experiments show that our Human MotionFormer sets the new state-of-the-art performance both qualitatively and quantitatively.
- Score: 73.48118882676276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human motion transfer aims to transfer motions from a target dynamic person
to a source static one for motion synthesis. An accurate matching between the
source person and the target motion in both large and subtle motion changes is
vital for improving the transferred motion quality. In this paper, we propose
Human MotionFormer, a hierarchical ViT framework that leverages global and
local perceptions to capture large and subtle motion matching, respectively. It
consists of two ViT encoders to extract input features (i.e., a target motion
image and a source human image) and a ViT decoder with several cascaded blocks
for feature matching and motion transfer. In each block, we set the target
motion feature as Query and the source person as Key and Value, calculating the
cross-attention maps to conduct a global feature matching. Further, we
introduce a convolutional layer to improve the local perception after the
global cross-attention computations. This matching process is implemented in
both warping and generation branches to guide the motion transfer. During
training, we propose a mutual learning loss to enable the co-supervision
between warping and generation branches for better motion representations.
Experiments show that our Human MotionFormer sets the new state-of-the-art
performance both qualitatively and quantitatively. Project page:
\url{https://github.com/KumapowerLIU/Human-MotionFormer}
Related papers
- Sitcom-Crafter: A Plot-Driven Human Motion Generation System in 3D Scenes [83.55301458112672]
Sitcom-Crafter is a system for human motion generation in 3D space.
Central to the function generation modules is our novel 3D scene-aware human-human interaction module.
Augmentation modules encompass plot comprehension for command generation, motion synchronization for seamless integration of different motion types.
arXiv Detail & Related papers (2024-10-14T17:56:19Z) - Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches [12.221087476416056]
We introduce "motion patches", a new representation of motion sequences, and propose using Vision Transformers (ViT) as motion encoders via transfer learning.
These motion patches, created by dividing and sorting skeleton joints based on motion sequences, are robust to varying skeleton structures.
We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis.
arXiv Detail & Related papers (2024-05-08T02:42:27Z) - Universal Humanoid Motion Representations for Physics-Based Control [71.46142106079292]
We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control.
We first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset.
We then create our motion representation by distilling skills directly from the imitator.
arXiv Detail & Related papers (2023-10-06T20:48:43Z) - Priority-Centric Human Motion Generation in Discrete Latent Space [59.401128190423535]
We introduce a Priority-Centric Motion Discrete Diffusion Model (M2DM) for text-to-motion generation.
M2DM incorporates a global self-attention mechanism and a regularization term to counteract code collapse.
We also present a motion discrete diffusion model that employs an innovative noise schedule, determined by the significance of each motion token.
arXiv Detail & Related papers (2023-08-28T10:40:16Z) - Task-Oriented Human-Object Interactions Generation with Implicit Neural
Representations [61.659439423703155]
TOHO: Task-Oriented Human-Object Interactions Generation with Implicit Neural Representations.
Our method generates continuous motions that are parameterized only by the temporal coordinate.
This work takes a step further toward general human-scene interaction simulation.
arXiv Detail & Related papers (2023-03-23T09:31:56Z) - MotionBERT: A Unified Perspective on Learning Human Motion
Representations [46.67364057245364]
We present a unified perspective on tackling various human-centric video tasks by learning human motion representations from large-scale and heterogeneous data resources.
We propose a pretraining stage in which a motion encoder is trained to recover the underlying 3D motion from noisy partial 2D observations.
We implement motion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer) neural network.
arXiv Detail & Related papers (2022-10-12T19:46:25Z) - REMOT: A Region-to-Whole Framework for Realistic Human Motion Transfer [96.64111294772141]
Human Video Motion Transfer (HVMT) aims to, given an image of a source person, generate his/her video that imitates the motion of the driving person.
Existing methods for HVMT mainly exploit Generative Adversarial Networks (GANs) to perform the warping operation.
This paper presents a novel REgionto-whole human MOtion Transfer framework based on GANs.
arXiv Detail & Related papers (2022-09-01T14:03:51Z) - Task-Generic Hierarchical Human Motion Prior using VAEs [44.356707509079044]
A deep generative model that describes human motions can benefit a wide range of fundamental computer vision and graphics tasks.
We present a method for learning complex human motions independent of specific tasks using a combined global and local latent space.
We demonstrate the effectiveness of our hierarchical motion variational autoencoder in a variety of tasks including video-based human pose estimation.
arXiv Detail & Related papers (2021-06-07T23:11:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.