Sapiens: Foundation for Human Vision Models
- URL: http://arxiv.org/abs/2408.12569v3
- Date: Tue, 27 Aug 2024 02:31:42 GMT
- Title: Sapiens: Foundation for Human Vision Models
- Authors: Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, Shunsuke Saito,
- Abstract summary: We present Sapiens, a family of models for four fundamental human-centric vision tasks.
Our models support 1K high-resolution inference and are easy to adapt for individual tasks.
We observe that self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks.
- Score: 14.72839332332364
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present Sapiens, a family of models for four fundamental human-centric vision tasks -- 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. We observe that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability -- model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing baselines across various human-centric benchmarks. We achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error. Project page: https://about.meta.com/realitylabs/codecavatars/sapiens.
Related papers
- PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions [57.871692507044344]
Pose estimation aims to accurately identify anatomical keypoints in humans and animals using monocular images.
Current models are typically trained and tested on clean data, potentially overlooking the corruption during real-world deployment.
We introduce PoseBench, a benchmark designed to evaluate the robustness of pose estimation models against real-world corruption.
arXiv Detail & Related papers (2024-06-20T14:40:17Z) - Cross-view and Cross-pose Completion for 3D Human Understanding [22.787947086152315]
We propose a pre-training approach based on self-supervised learning that works on human-centric data using only images.
We pre-train a model for body-centric tasks and one for hand-centric tasks.
With a generic transformer architecture, these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks.
arXiv Detail & Related papers (2023-11-15T16:51:18Z) - HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception [97.55089867970874]
We introduce masked image modeling (MIM) as a pre-training approach for this task.
Motivated by this insight, we incorporate an intuitive human structure prior - human parts - into pre-training.
This encourages the model to concentrate more on body structure information during pre-training, yielding substantial benefits across a range of human-centric perception tasks.
arXiv Detail & Related papers (2023-10-31T17:56:11Z) - SynBody: Synthetic Dataset with Layered Human Models for 3D Human
Perception and Modeling [93.60731530276911]
We introduce a new synthetic dataset, SynBody, with three appealing features.
The dataset comprises 1.2M images with corresponding accurate 3D annotations, covering 10,000 human body models, 1,187 actions, and various viewpoints.
arXiv Detail & Related papers (2023-03-30T13:30:12Z) - UniHCP: A Unified Model for Human-Centric Perceptions [75.38263862084641]
We propose a Unified Model for Human-Centric Perceptions (UniHCP)
UniHCP unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture.
With large-scale joint training on 33 human-centric datasets, UniHCP can outperform strong baselines by direct evaluation.
arXiv Detail & Related papers (2023-03-06T07:10:07Z) - StyleGAN-Human: A Data-Centric Odyssey of Human Generation [96.7080874757475]
This work takes a data-centric perspective and investigates multiple critical aspects in "data engineering"
We collect and annotate a large-scale human image dataset with over 230K samples capturing diverse poses and textures.
We rigorously investigate three essential factors in data engineering for StyleGAN-based human generation, namely data size, data distribution, and data alignment.
arXiv Detail & Related papers (2022-04-25T17:55:08Z) - Partial success in closing the gap between human and machine vision [30.78663978510427]
A few years ago, the first CNN surpassed human performance on ImageNet.
Here we ask: Are we making progress in closing the gap between human and machine vision?
We tested human observers on a broad range of out-of-distribution (OOD) datasets.
arXiv Detail & Related papers (2021-06-14T13:23:35Z) - LiftFormer: 3D Human Pose Estimation using attention models [0.0]
We propose the usage of models to obtain more accurate 3D predictions by leveraging attention mechanisms on ordered sequences human poses in videos.
Our method consistently outperforms the previous best results from the literature when using both 2D keypoint predictors by 0.3 mm (44.8 MPJPE, 0.7% improvement) and ground truth inputs by 2mm (MPJPE: 31.9, 8.4% improvement) on Human3.6M.
Our 3D lifting model's accuracy exceeds that of other end-to-end or SMPL approaches and is comparable to many multi-view methods.
arXiv Detail & Related papers (2020-09-01T11:05:45Z) - Cascaded deep monocular 3D human pose estimation with evolutionary
training data [76.3478675752847]
Deep representation learning has achieved remarkable accuracy for monocular 3D human pose estimation.
This paper proposes a novel data augmentation method that is scalable for massive amount of training data.
Our method synthesizes unseen 3D human skeletons based on a hierarchical human representation and synthesizings inspired by prior knowledge.
arXiv Detail & Related papers (2020-06-14T03:09:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.