Playing for 3D Human Recovery
- URL: http://arxiv.org/abs/2110.07588v1
- Date: Thu, 14 Oct 2021 17:49:42 GMT
- Title: Playing for 3D Human Recovery
- Authors: Zhongang Cai, Mingyuan Zhang, Jiawei Ren, Chen Wei, Daxuan Ren,
Jiatong Li, Zhengyu Lin, Haiyu Zhao, Shuai Yi, Lei Yang, Chen Change Loy,
Ziwei Liu
- Abstract summary: In this work, we obtain massive human sequences as well as their 3D ground truths by playing video games.
Specifically, we contribute, GTA-Human, a mega-scale and highly-diverse 3D human dataset generated with the GTA-V game engine.
With a rich set of subjects, actions, and scenarios, GTA-Human serves as both an effective training source.
- Score: 74.01259933358331
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image- and video-based 3D human recovery (i.e. pose and shape estimation)
have achieved substantial progress. However, due to the prohibitive cost of
motion capture, existing datasets are often limited in scale and diversity,
which hinders the further development of more powerful models. In this work, we
obtain massive human sequences as well as their 3D ground truths by playing
video games. Specifically, we contribute, GTA-Human, a mega-scale and
highly-diverse 3D human dataset generated with the GTA-V game engine. With a
rich set of subjects, actions, and scenarios, GTA-Human serves as both an
effective training source. Notably, the "unreasonable effectiveness of data"
phenomenon is validated in 3D human recovery using our game-playing data. A
simple frame-based baseline trained on GTA-Human already outperforms more
sophisticated methods by a large margin; for video-based methods, GTA-Human
demonstrates superiority over even the in-domain training set. We extend our
study to larger models to observe the same consistent improvements, and the
study on supervision signals suggests the rich collection of SMPL annotations
is key. Furthermore, equipped with the diverse annotations in GTA-Human, we
systematically investigate the performance of various methods under a wide
spectrum of real-world variations, e.g. camera angles, poses, and occlusions.
We hope our work could pave way for scaling up 3D human recovery to the real
world.
Related papers
- HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation [64.37874983401221]
We present HumanVid, the first large-scale high-quality dataset tailored for human image animation.
For the real-world data, we compile a vast collection of copyright-free real-world videos from the internet.
For the synthetic data, we gather 2,300 copyright-free 3D avatar assets to augment existing available 3D assets.
arXiv Detail & Related papers (2024-07-24T17:15:58Z) - MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human
Captures [44.172804112944625]
We present MVHumanNet, a dataset that comprises multi-view human action sequences of 4,500 human identities.
Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions.
arXiv Detail & Related papers (2023-12-05T18:50:12Z) - Get3DHuman: Lifting StyleGAN-Human into a 3D Generative Model using
Pixel-aligned Reconstruction Priors [56.192682114114724]
Get3DHuman is a novel 3D human framework that can significantly boost the realism and diversity of the generated outcomes.
Our key observation is that the 3D generator can profit from human-related priors learned through 2D human generators and 3D reconstructors.
arXiv Detail & Related papers (2023-02-02T15:37:46Z) - FLEX: Full-Body Grasping Without Full-Body Grasps [24.10724524386518]
We address the task of generating a virtual human -- hands and full body -- grasping everyday objects.
Existing methods approach this problem by collecting a 3D dataset of humans interacting with objects and training on this data.
We leverage the existence of both full-body pose and hand grasping priors, composing them using 3D geometrical constraints to obtain full-body grasps.
arXiv Detail & Related papers (2022-11-21T23:12:54Z) - Decanus to Legatus: Synthetic training for 2D-3D human pose lifting [26.108023246654646]
We propose an algorithm to generate infinite 3D synthetic human poses (Legatus) from a 3D pose distribution based on 10 initial handcrafted 3D poses (Decanus)
Our results show that we can achieve 3D pose estimation performance comparable to methods using real data from specialized datasets but in a zero-shot setup, showing the potential of our framework.
arXiv Detail & Related papers (2022-10-05T13:10:19Z) - Human Performance Capture from Monocular Video in the Wild [50.34917313325813]
We propose a method capable of capturing the dynamic 3D human shape from a monocular video featuring challenging body poses.
Our method outperforms state-of-the-art methods on an in-the-wild human video dataset 3DPW.
arXiv Detail & Related papers (2021-11-29T16:32:41Z) - S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling [103.65625425020129]
We represent the pedestrian's shape, pose and skinning weights as neural implicit functions that are directly learned from data.
We demonstrate the effectiveness of our approach on various datasets and show that our reconstructions outperform existing state-of-the-art methods.
arXiv Detail & Related papers (2021-01-17T02:16:56Z) - Benchmarking End-to-End Behavioural Cloning on Video Games [5.863352129133669]
We study the general applicability of behavioural cloning on twelve video games, including six modern video games (published after 2010)
Our results show that these agents cannot match humans in raw performance but do learn basic dynamics and rules.
We also demonstrate how the quality of the data matters, and how recording data from humans is subject to a state-action mismatch, due to human reflexes.
arXiv Detail & Related papers (2020-04-02T13:31:51Z) - Chained Representation Cycling: Learning to Estimate 3D Human Pose and
Shape by Cycling Between Representations [73.11883464562895]
We propose a new architecture that facilitates unsupervised, or lightly supervised, learning.
We demonstrate the method by learning 3D human pose and shape from un-paired and un-annotated images.
While we present results for modeling humans, our formulation is general and can be applied to other vision problems.
arXiv Detail & Related papers (2020-01-06T14:54:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.