SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation
- URL: http://arxiv.org/abs/2309.17448v3
- Date: Sun, 28 Jul 2024 09:17:08 GMT
- Title: SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation
- Authors: Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Yanjun Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Lei Yang, Ziwei Liu,
- Abstract summary: Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications.
In this work, we investigate scaling up EHPS towards the first generalist foundation model (dubbed SMPLer-X) with up to ViT-Huge as the backbone.
With big data and the large model, SMPLer-X exhibits strong performance across diverse test benchmarks and excellent transferability to even unseen environments.
- Score: 83.18930314027254
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods still depend largely on a confined set of training datasets. In this work, we investigate scaling up EHPS towards the first generalist foundation model (dubbed SMPLer-X), with up to ViT-Huge as the backbone and training with up to 4.5M instances from diverse data sources. With big data and the large model, SMPLer-X exhibits strong performance across diverse test benchmarks and excellent transferability to even unseen environments. 1) For the data scaling, we perform a systematic investigation on 32 EHPS datasets, including a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. 2) For the model scaling, we take advantage of vision transformers to study the scaling law of model sizes in EHPS. Moreover, our finetuning strategy turn SMPLer-X into specialist models, allowing them to achieve further performance boosts. Notably, our foundation model SMPLer-X consistently delivers state-of-the-art results on seven benchmarks such as AGORA (107.2 mm NMVE), UBody (57.4 mm PVE), EgoBody (63.6 mm PVE), and EHF (62.3 mm PVE without finetuning). Homepage: https://caizhongang.github.io/projects/SMPLer-X/
Related papers
- Neural Localizer Fields for Continuous 3D Human Pose and Shape Estimation [32.30055363306321]
We propose a paradigm for seamlessly unifying different human pose and shape-related tasks and datasets.
Our formulation is centered on the ability - both at training and test time - to query any arbitrary point of the human volume.
We can naturally exploit differently annotated data sources including mesh, 2D/3D skeleton and dense pose, without having to convert between them.
arXiv Detail & Related papers (2024-07-10T10:44:18Z) - Pretraining Billion-scale Geospatial Foundational Models on Frontier [0.16492989697868893]
Foundation Models (FMs) are trained with internet-scale unlabeled data via self-supervised learning.
We investigate billion scale FMs and HPC training profiles for geospatial applications by pretraining on publicly available data.
Our larger 3B parameter size model achieves up to 30% improvement in top1 scene classification accuracy.
arXiv Detail & Related papers (2024-04-17T19:16:32Z) - SDPose: Tokenized Pose Estimation via Circulation-Guide Self-Distillation [53.675725490807615]
We introduce SDPose, a new self-distillation method for improving the performance of small transformer-based models.
SDPose-T obtains 69.7% mAP with 4.4M parameters and 1.8 GFLOPs, while SDPose-S-V2 obtains 73.5% mAP on the MSCOCO validation dataset.
arXiv Detail & Related papers (2024-04-04T15:23:14Z) - DPOT: Auto-Regressive Denoising Operator Transformer for Large-Scale PDE Pre-Training [87.90342423839876]
We present a new auto-regressive denoising pre-training strategy, which allows for more stable and efficient pre-training on PDE data.
We train our PDE foundation model with up to 0.5B parameters on 10+ PDE datasets with more than 100k trajectories.
arXiv Detail & Related papers (2024-03-06T08:38:34Z) - Turbulence in Focus: Benchmarking Scaling Behavior of 3D Volumetric
Super-Resolution with BLASTNet 2.0 Data [4.293221567339693]
Analysis of compressible turbulent flows is essential for applications related to propulsion, energy generation, and the environment.
We present a 2.2 TB network-of-datasets containing 744 full-domain samples from 34 high-fidelity direct numerical simulations.
We benchmark a total of 49 variations of five deep learning approaches for 3D super-resolution.
arXiv Detail & Related papers (2023-09-23T18:57:02Z) - Towards Foundation Models for Scientific Machine Learning:
Characterizing Scaling and Transfer Behavior [32.74388989649232]
We study how pre-training could be used for scientific machine learning (SciML) applications.
We find that fine-tuning these models yields more performance gains as model size increases.
arXiv Detail & Related papers (2023-06-01T00:32:59Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - Inference Stage Optimization for Cross-scenario 3D Human Pose Estimation [97.93687743378106]
Existing 3D pose estimation models suffer performance drop when applying to new scenarios with unseen poses.
We propose a novel framework, Inference Stage Optimization (ISO), for improving the generalizability of 3D pose models.
Remarkably, it yields new state-of-the-art of 83.6% 3D PCK on MPI-INF-3DHP, improving upon the previous best result by 9.7%.
arXiv Detail & Related papers (2020-07-04T09:45:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.