MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human
Captures
- URL: http://arxiv.org/abs/2312.02963v1
- Date: Tue, 5 Dec 2023 18:50:12 GMT
- Title: MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human
Captures
- Authors: Zhangyang Xiong, Chenghong Li, Kenkun Liu, Hongjie Liao, Jianqiao Hu,
Junyi Zhu, Shuliang Ning, Lingteng Qiu, Chongjie Wang, Shijie Wang, Shuguang
Cui and Xiaoguang Han
- Abstract summary: We present MVHumanNet, a dataset that comprises multi-view human action sequences of 4,500 human identities.
Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions.
- Score: 44.172804112944625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this era, the success of large language models and text-to-image models
can be attributed to the driving force of large-scale datasets. However, in the
realm of 3D vision, while remarkable progress has been made with models trained
on large-scale synthetic and real-captured object data like Objaverse and
MVImgNet, a similar level of progress has not been observed in the domain of
human-centric tasks partially due to the lack of a large-scale human dataset.
Existing datasets of high-fidelity 3D human capture continue to be mid-sized
due to the significant challenges in acquiring large-scale high-quality 3D
human data. To bridge this gap, we present MVHumanNet, a dataset that comprises
multi-view human action sequences of 4,500 human identities. The primary focus
of our work is on collecting human data that features a large number of diverse
identities and everyday clothing using a multi-view human capture system, which
facilitates easily scalable data collection. Our dataset contains 9,000 daily
outfits, 60,000 motion sequences and 645 million frames with extensive
annotations, including human masks, camera parameters, 2D and 3D keypoints,
SMPL/SMPLX parameters, and corresponding textual descriptions. To explore the
potential of MVHumanNet in various 2D and 3D visual tasks, we conducted pilot
studies on view-consistent action recognition, human NeRF reconstruction,
text-driven view-unconstrained human image generation, as well as 2D
view-unconstrained human image and 3D avatar generation. Extensive experiments
demonstrate the performance improvements and effective applications enabled by
the scale provided by MVHumanNet. As the current largest-scale 3D human
dataset, we hope that the release of MVHumanNet data with annotations will
foster further innovations in the domain of 3D human-centric tasks at scale.
Related papers
- HumanVLM: Foundation for Human-Scene Vision-Language Model [3.583459930633303]
Human-scene vision-language tasks are increasingly prevalent in diverse social applications.
This study introduces a domain-specific Large Vision-Language Model, Human-Scene Vision-Language Model (HumanVLM)
In the experiments, we then evaluate our HumanVLM across varous downstream tasks, where it demonstrates superior overall performance.
arXiv Detail & Related papers (2024-11-05T12:14:57Z) - HINT: Learning Complete Human Neural Representations from Limited Viewpoints [69.76947323932107]
We propose a NeRF-based algorithm able to learn a detailed and complete human model from limited viewing angles.
As a result, our method can reconstruct complete humans even from a few viewing angles, increasing performance by more than 15% PSNR.
arXiv Detail & Related papers (2024-05-30T05:43:09Z) - 3D Human Reconstruction in the Wild with Synthetic Data Using Generative Models [52.96248836582542]
We propose an effective approach based on recent diffusion models, termed HumanWild, which can effortlessly generate human images and corresponding 3D mesh annotations.
By exclusively employing generative models, we generate large-scale in-the-wild human images and high-quality annotations, eliminating the need for real-world data collection.
arXiv Detail & Related papers (2024-03-17T06:31:16Z) - Cross-view and Cross-pose Completion for 3D Human Understanding [22.787947086152315]
We propose a pre-training approach based on self-supervised learning that works on human-centric data using only images.
We pre-train a model for body-centric tasks and one for hand-centric tasks.
With a generic transformer architecture, these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks.
arXiv Detail & Related papers (2023-11-15T16:51:18Z) - DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity
Human-centric Rendering [126.00165445599764]
We present DNA-Rendering, a large-scale, high-fidelity repository of human performance data for neural actor rendering.
Our dataset contains over 1500 human subjects, 5000 motion sequences, and 67.5M frames' data volume.
We construct a professional multi-view system to capture data, which contains 60 synchronous cameras with max 4096 x 3000 resolution, 15 fps speed, and stern camera calibration steps.
arXiv Detail & Related papers (2023-07-19T17:58:03Z) - 3D Segmentation of Humans in Point Clouds with Synthetic Data [21.518379214837278]
We propose the task of joint 3D human semantic segmentation, instance segmentation and multi-human body-part segmentation.
We propose a framework for generating training data of synthetic humans interacting with real 3D scenes.
We also propose a novel transformer-based model, Human3D, which is the first end-to-end model for segmenting multiple human instances and their body-parts.
arXiv Detail & Related papers (2022-12-01T18:59:21Z) - Playing for 3D Human Recovery [88.91567909861442]
In this work, we obtain massive human sequences by playing the video game with automatically annotated 3D ground truths.
Specifically, we contribute GTA-Human, a large-scale 3D human dataset generated with the GTA-V game engine.
A simple frame-based baseline trained on GTA-Human outperforms more sophisticated methods by a large margin.
arXiv Detail & Related papers (2021-10-14T17:49:42Z) - HUMBI: A Large Multiview Dataset of Human Body Expressions and Benchmark
Challenge [33.26419876973344]
This paper presents a new large multiview dataset called HUMBI for human body expressions with natural clothing.
107 synchronized HD cameras are used to capture 772 distinctive subjects across gender, ethnicity, age, and style.
We reconstruct high fidelity body expressions using 3D mesh models, which allows representing view-specific appearance.
arXiv Detail & Related papers (2021-09-30T23:19:25Z) - S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling [103.65625425020129]
We represent the pedestrian's shape, pose and skinning weights as neural implicit functions that are directly learned from data.
We demonstrate the effectiveness of our approach on various datasets and show that our reconstructions outperform existing state-of-the-art methods.
arXiv Detail & Related papers (2021-01-17T02:16:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.