Playing for 3D Human Recovery
- URL: http://arxiv.org/abs/2110.07588v1
- Date: Thu, 14 Oct 2021 17:49:42 GMT
- Title: Playing for 3D Human Recovery
- Authors: Zhongang Cai, Mingyuan Zhang, Jiawei Ren, Chen Wei, Daxuan Ren,
Jiatong Li, Zhengyu Lin, Haiyu Zhao, Shuai Yi, Lei Yang, Chen Change Loy,
Ziwei Liu
- Abstract summary: In this work, we obtain massive human sequences as well as their 3D ground truths by playing video games.
Specifically, we contribute, GTA-Human, a mega-scale and highly-diverse 3D human dataset generated with the GTA-V game engine.
With a rich set of subjects, actions, and scenarios, GTA-Human serves as both an effective training source.
- Score: 74.01259933358331
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image- and video-based 3D human recovery (i.e. pose and shape estimation)
have achieved substantial progress. However, due to the prohibitive cost of
motion capture, existing datasets are often limited in scale and diversity,
which hinders the further development of more powerful models. In this work, we
obtain massive human sequences as well as their 3D ground truths by playing
video games. Specifically, we contribute, GTA-Human, a mega-scale and
highly-diverse 3D human dataset generated with the GTA-V game engine. With a
rich set of subjects, actions, and scenarios, GTA-Human serves as both an
effective training source. Notably, the "unreasonable effectiveness of data"
phenomenon is validated in 3D human recovery using our game-playing data. A
simple frame-based baseline trained on GTA-Human already outperforms more
sophisticated methods by a large margin; for video-based methods, GTA-Human
demonstrates superiority over even the in-domain training set. We extend our
study to larger models to observe the same consistent improvements, and the
study on supervision signals suggests the rich collection of SMPL annotations
is key. Furthermore, equipped with the diverse annotations in GTA-Human, we
systematically investigate the performance of various methods under a wide
spectrum of real-world variations, e.g. camera angles, poses, and occlusions.
We hope our work could pave way for scaling up 3D human recovery to the real
world.
Related papers
- MVHumanNet++: A Large-scale Dataset of Multi-view Daily Dressing Human Captures with Richer Annotations for 3D Human Digitization [36.46025784260418]
We present MVHumanNet++, a dataset that comprises multi-view human action sequences of 4,500 human identities.<n>Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million frames with extensive annotations.
arXiv Detail & Related papers (2025-05-03T15:02:34Z) - FAMOUS: High-Fidelity Monocular 3D Human Digitization Using View Synthesis [51.193297565630886]
The challenge of accurately inferring texture remains, particularly in obscured areas such as the back of a person in frontal-view images.
This limitation in texture prediction largely stems from the scarcity of large-scale and diverse 3D datasets.
We propose leveraging extensive 2D fashion datasets to enhance both texture and shape prediction in 3D human digitization.
arXiv Detail & Related papers (2024-10-13T01:25:05Z) - MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human
Captures [44.172804112944625]
We present MVHumanNet, a dataset that comprises multi-view human action sequences of 4,500 human identities.
Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions.
arXiv Detail & Related papers (2023-12-05T18:50:12Z) - Learning Human Action Recognition Representations Without Real Humans [66.61527869763819]
We present a benchmark that leverages real-world videos with humans removed and synthetic data containing virtual humans to pre-train a model.
We then evaluate the transferability of the representation learned on this data to a diverse set of downstream action recognition benchmarks.
Our approach outperforms previous baselines by up to 5%.
arXiv Detail & Related papers (2023-11-10T18:38:14Z) - BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike
Animated Motion [52.11972919802401]
We show that neural networks trained only on synthetic data achieve state-of-the-art accuracy on the problem of 3D human pose and shape estimation from real images.
Previous synthetic datasets have been small, unrealistic, or lacked realistic clothing.
arXiv Detail & Related papers (2023-06-29T13:35:16Z) - 3D Segmentation of Humans in Point Clouds with Synthetic Data [21.518379214837278]
We propose the task of joint 3D human semantic segmentation, instance segmentation and multi-human body-part segmentation.
We propose a framework for generating training data of synthetic humans interacting with real 3D scenes.
We also propose a novel transformer-based model, Human3D, which is the first end-to-end model for segmenting multiple human instances and their body-parts.
arXiv Detail & Related papers (2022-12-01T18:59:21Z) - Hands-Up: Leveraging Synthetic Data for Hands-On-Wheel Detection [0.38233569758620045]
This work demonstrates the use of synthetic photo-realistic in-cabin data to train a Driver Monitoring System.
We show how performing error analysis and generating the missing edge-cases in our platform boosts performance.
This showcases the ability of human-centric synthetic data to generalize well to the real world.
arXiv Detail & Related papers (2022-05-31T23:34:12Z) - S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling [103.65625425020129]
We represent the pedestrian's shape, pose and skinning weights as neural implicit functions that are directly learned from data.
We demonstrate the effectiveness of our approach on various datasets and show that our reconstructions outperform existing state-of-the-art methods.
arXiv Detail & Related papers (2021-01-17T02:16:56Z) - Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations
in 3D [71.11034329713058]
Existing datasets lack large-scale, high-quality 3D ground truth information.
Rel3D is the first large-scale, human-annotated dataset for grounding spatial relations in 3D.
We propose minimally contrastive data collection -- a novel crowdsourcing method for reducing dataset bias.
arXiv Detail & Related papers (2020-12-03T01:51:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.