Playing for 3D Human Recovery
- URL: http://arxiv.org/abs/2110.07588v3
- Date: Sun, 8 Sep 2024 16:20:11 GMT
- Title: Playing for 3D Human Recovery
- Authors: Zhongang Cai, Mingyuan Zhang, Jiawei Ren, Chen Wei, Daxuan Ren, Zhengyu Lin, Haiyu Zhao, Lei Yang, Chen Change Loy, Ziwei Liu,
- Abstract summary: In this work, we obtain massive human sequences by playing the video game with automatically annotated 3D ground truths.
Specifically, we contribute GTA-Human, a large-scale 3D human dataset generated with the GTA-V game engine.
A simple frame-based baseline trained on GTA-Human outperforms more sophisticated methods by a large margin.
- Score: 88.91567909861442
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image- and video-based 3D human recovery (i.e., pose and shape estimation) have achieved substantial progress. However, due to the prohibitive cost of motion capture, existing datasets are often limited in scale and diversity. In this work, we obtain massive human sequences by playing the video game with automatically annotated 3D ground truths. Specifically, we contribute GTA-Human, a large-scale 3D human dataset generated with the GTA-V game engine, featuring a highly diverse set of subjects, actions, and scenarios. More importantly, we study the use of game-playing data and obtain five major insights. First, game-playing data is surprisingly effective. A simple frame-based baseline trained on GTA-Human outperforms more sophisticated methods by a large margin. For video-based methods, GTA-Human is even on par with the in-domain training set. Second, we discover that synthetic data provides critical complements to the real data that is typically collected indoor. Our investigation into domain gap provides explanations for our data mixture strategies that are simple yet useful. Third, the scale of the dataset matters. The performance boost is closely related to the additional data available. A systematic study reveals the model sensitivity to data density from multiple key aspects. Fourth, the effectiveness of GTA-Human is also attributed to the rich collection of strong supervision labels (SMPL parameters), which are otherwise expensive to acquire in real datasets. Fifth, the benefits of synthetic data extend to larger models such as deeper convolutional neural networks (CNNs) and Transformers, for which a significant impact is also observed. We hope our work could pave the way for scaling up 3D human recovery to the real world. Homepage: https://caizhongang.github.io/projects/GTA-Human/
Related papers
- FAMOUS: High-Fidelity Monocular 3D Human Digitization Using View Synthesis [51.193297565630886]
The challenge of accurately inferring texture remains, particularly in obscured areas such as the back of a person in frontal-view images.
This limitation in texture prediction largely stems from the scarcity of large-scale and diverse 3D datasets.
We propose leveraging extensive 2D fashion datasets to enhance both texture and shape prediction in 3D human digitization.
arXiv Detail & Related papers (2024-10-13T01:25:05Z) - MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human
Captures [44.172804112944625]
We present MVHumanNet, a dataset that comprises multi-view human action sequences of 4,500 human identities.
Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions.
arXiv Detail & Related papers (2023-12-05T18:50:12Z) - Learning Human Action Recognition Representations Without Real Humans [66.61527869763819]
We present a benchmark that leverages real-world videos with humans removed and synthetic data containing virtual humans to pre-train a model.
We then evaluate the transferability of the representation learned on this data to a diverse set of downstream action recognition benchmarks.
Our approach outperforms previous baselines by up to 5%.
arXiv Detail & Related papers (2023-11-10T18:38:14Z) - BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike
Animated Motion [52.11972919802401]
We show that neural networks trained only on synthetic data achieve state-of-the-art accuracy on the problem of 3D human pose and shape estimation from real images.
Previous synthetic datasets have been small, unrealistic, or lacked realistic clothing.
arXiv Detail & Related papers (2023-06-29T13:35:16Z) - 3D Segmentation of Humans in Point Clouds with Synthetic Data [21.518379214837278]
We propose the task of joint 3D human semantic segmentation, instance segmentation and multi-human body-part segmentation.
We propose a framework for generating training data of synthetic humans interacting with real 3D scenes.
We also propose a novel transformer-based model, Human3D, which is the first end-to-end model for segmenting multiple human instances and their body-parts.
arXiv Detail & Related papers (2022-12-01T18:59:21Z) - Hands-Up: Leveraging Synthetic Data for Hands-On-Wheel Detection [0.38233569758620045]
This work demonstrates the use of synthetic photo-realistic in-cabin data to train a Driver Monitoring System.
We show how performing error analysis and generating the missing edge-cases in our platform boosts performance.
This showcases the ability of human-centric synthetic data to generalize well to the real world.
arXiv Detail & Related papers (2022-05-31T23:34:12Z) - S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling [103.65625425020129]
We represent the pedestrian's shape, pose and skinning weights as neural implicit functions that are directly learned from data.
We demonstrate the effectiveness of our approach on various datasets and show that our reconstructions outperform existing state-of-the-art methods.
arXiv Detail & Related papers (2021-01-17T02:16:56Z) - Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations
in 3D [71.11034329713058]
Existing datasets lack large-scale, high-quality 3D ground truth information.
Rel3D is the first large-scale, human-annotated dataset for grounding spatial relations in 3D.
We propose minimally contrastive data collection -- a novel crowdsourcing method for reducing dataset bias.
arXiv Detail & Related papers (2020-12-03T01:51:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.