HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception
- URL: http://arxiv.org/abs/2310.20695v1
- Date: Tue, 31 Oct 2023 17:56:11 GMT
- Title: HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception
- Authors: Junkun Yuan, Xinyu Zhang, Hao Zhou, Jian Wang, Zhongwei Qiu, Zhiyin
Shao, Shaofeng Zhang, Sifan Long, Kun Kuang, Kun Yao, Junyu Han, Errui Ding,
Lanfen Lin, Fei Wu, Jingdong Wang
- Abstract summary: We introduce masked image modeling (MIM) as a pre-training approach for this task.
Motivated by this insight, we incorporate an intuitive human structure prior - human parts - into pre-training.
This encourages the model to concentrate more on body structure information during pre-training, yielding substantial benefits across a range of human-centric perception tasks.
- Score: 97.55089867970874
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model pre-training is essential in human-centric perception. In this paper,
we first introduce masked image modeling (MIM) as a pre-training approach for
this task. Upon revisiting the MIM training strategy, we reveal that human
structure priors offer significant potential. Motivated by this insight, we
further incorporate an intuitive human structure prior - human parts - into
pre-training. Specifically, we employ this prior to guide the mask sampling
process. Image patches, corresponding to human part regions, have high priority
to be masked out. This encourages the model to concentrate more on body
structure information during pre-training, yielding substantial benefits across
a range of human-centric perception tasks. To further capture human
characteristics, we propose a structure-invariant alignment loss that enforces
different masked views, guided by the human part prior, to be closely aligned
for the same image. We term the entire method as HAP. HAP simply uses a plain
ViT as the encoder yet establishes new state-of-the-art performance on 11
human-centric benchmarks, and on-par result on one dataset. For example, HAP
achieves 78.1% mAP on MSMT17 for person re-identification, 86.54% mA on PA-100K
for pedestrian attribute recognition, 78.2% AP on MS COCO for 2D pose
estimation, and 56.0 PA-MPJPE on 3DPW for 3D pose and shape estimation.
Related papers
- Sapiens: Foundation for Human Vision Models [14.72839332332364]
We present Sapiens, a family of models for four fundamental human-centric vision tasks.
Our models support 1K high-resolution inference and are easy to adapt for individual tasks.
We observe that self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks.
arXiv Detail & Related papers (2024-08-22T17:37:27Z) - HINT: Learning Complete Human Neural Representations from Limited Viewpoints [69.76947323932107]
We propose a NeRF-based algorithm able to learn a detailed and complete human model from limited viewing angles.
As a result, our method can reconstruct complete humans even from a few viewing angles, increasing performance by more than 15% PSNR.
arXiv Detail & Related papers (2024-05-30T05:43:09Z) - AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation [55.179287851188036]
We introduce a novel all-in-one-stage framework, AiOS, for expressive human pose and shape recovery without an additional human detection step.
We first employ a human token to probe a human location in the image and encode global features for each instance.
Then, we introduce a joint-related token to probe the human joint in the image and encoder a fine-grained local feature.
arXiv Detail & Related papers (2024-03-26T17:59:23Z) - Cross-view and Cross-pose Completion for 3D Human Understanding [22.787947086152315]
We propose a pre-training approach based on self-supervised learning that works on human-centric data using only images.
We pre-train a model for body-centric tasks and one for hand-centric tasks.
With a generic transformer architecture, these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks.
arXiv Detail & Related papers (2023-11-15T16:51:18Z) - Toward High Quality Facial Representation Learning [58.873356953627614]
We propose a self-supervised pre-training framework, called Mask Contrastive Face (MCF)
We use feature map of a pre-trained visual backbone as a supervision item and use a partially pre-trained decoder for mask image modeling.
Our model achieves 0.932 NME_diag$ for AFLW-19 face alignment and 93.96 F1 score for LaPa face parsing.
arXiv Detail & Related papers (2023-09-07T09:11:49Z) - UniHCP: A Unified Model for Human-Centric Perceptions [75.38263862084641]
We propose a Unified Model for Human-Centric Perceptions (UniHCP)
UniHCP unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture.
With large-scale joint training on 33 human-centric datasets, UniHCP can outperform strong baselines by direct evaluation.
arXiv Detail & Related papers (2023-03-06T07:10:07Z) - Bottom-Up 2D Pose Estimation via Dual Anatomical Centers for Small-Scale
Persons [75.86463396561744]
In multi-person 2D pose estimation, the bottom-up methods simultaneously predict poses for all persons.
Our method achieves 38.4% improvement on bounding box precision and 39.1% improvement on bounding box recall over the state of the art (SOTA)
For the human pose AP evaluation, we achieve a new SOTA (71.0 AP) on the COCO test-dev set with the single-scale testing.
arXiv Detail & Related papers (2022-08-25T10:09:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.