SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation
- URL: http://arxiv.org/abs/2501.09782v1
- Date: Thu, 16 Jan 2025 18:59:46 GMT
- Title: SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation
- Authors: Wanqi Yin, Zhongang Cai, Ruisi Wang, Ailing Zeng, Chen Wei, Qingping Sun, Haiyi Mei, Yanjun Wang, Hui En Pang, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Atsushi Yamashita, Lei Yang, Ziwei Liu,
- Abstract summary: Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications.
Current state-of-the-art methods focus on training innovative architectural designs on confined datasets.
We investigate the impact of scaling up EHPS towards a family of generalist foundation models.
- Score: 81.36747103102459
- License:
- Abstract: Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods focus on training innovative architectural designs on confined datasets. In this work, we investigate the impact of scaling up EHPS towards a family of generalist foundation models. 1) For data scaling, we perform a systematic investigation on 40 EHPS datasets, encompassing a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. Ultimately, we achieve diminishing returns at 10M training instances from diverse data sources. 2) For model scaling, we take advantage of vision transformers (up to ViT-Huge as the backbone) to study the scaling law of model sizes in EHPS. To exclude the influence of algorithmic design, we base our experiments on two minimalist architectures: SMPLer-X, which consists of an intermediate step for hand and face localization, and SMPLest-X, an even simpler version that reduces the network to its bare essentials and highlights significant advances in the capture of articulated hands. With big data and the large model, the foundation models exhibit strong performance across diverse test benchmarks and excellent transferability to even unseen environments. Moreover, our finetuning strategy turns the generalist into specialist models, allowing them to achieve further performance boosts. Notably, our foundation models consistently deliver state-of-the-art results on seven benchmarks such as AGORA, UBody, EgoBody, and our proposed SynHand dataset for comprehensive hand evaluation. (Code is available at: https://github.com/wqyin/SMPLest-X).
Related papers
- DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data [61.62554324594797]
We propose DreamMask, which explores how to generate training data in the open-vocabulary setting, and how to train the model with both real and synthetic data.
In general, DreamMask significantly simplifies the collection of large-scale training data, serving as a plug-and-play enhancement for existing methods.
For instance, when trained on COCO and tested on ADE20K, the model equipped with DreamMask outperforms the previous state-of-the-art by a substantial margin of 2.1% mIoU.
arXiv Detail & Related papers (2025-01-03T19:00:00Z) - Scaling Laws for Task-Optimized Models of the Primate Visual Ventral Stream [3.4526439922541705]
We evaluate scaling laws for modeling the primate visual ventral stream (VVS)
We observe that while behavioral alignment continues to scale with larger models, neural alignment saturates.
Increased scaling is especially beneficial for higher-level visual areas, where small models trained on few samples exhibit only poor alignment.
arXiv Detail & Related papers (2024-11-08T17:13:53Z) - OReole-FM: successes and challenges toward billion-parameter foundation models for high-resolution satellite imagery [0.3926357402982764]
Scaling models to billions of parameters has been shown to yield unprecedented benefits including emergent abilities.
We pair high-performance computing resources including Frontier supercomputer, America's first exascale system, and high-resolution optical RS data to pretrain billion-scale FMs.
arXiv Detail & Related papers (2024-10-25T20:55:12Z) - GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation Models [41.76935689355034]
Discriminative and generative pretraining have yielded geometry estimation models with strong generalization capabilities.
We build fair and strong baselines for evaluating and analyzing the geometry estimation models.
We evaluate monocular geometry estimators on more challenging benchmarks for geometry estimation task with diverse scenes and high-quality annotations.
arXiv Detail & Related papers (2024-06-18T14:44:12Z) - Pretraining Billion-scale Geospatial Foundational Models on Frontier [0.16492989697868893]
Foundation Models (FMs) are trained with internet-scale unlabeled data via self-supervised learning.
We investigate billion scale FMs and HPC training profiles for geospatial applications by pretraining on publicly available data.
Our larger 3B parameter size model achieves up to 30% improvement in top1 scene classification accuracy.
arXiv Detail & Related papers (2024-04-17T19:16:32Z) - SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation [83.18930314027254]
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications.
In this work, we investigate scaling up EHPS towards the first generalist foundation model (dubbed SMPLer-X) with up to ViT-Huge as the backbone.
With big data and the large model, SMPLer-X exhibits strong performance across diverse test benchmarks and excellent transferability to even unseen environments.
arXiv Detail & Related papers (2023-09-29T17:58:06Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.