Multi-Human Mesh Recovery with Transformers
- URL: http://arxiv.org/abs/2402.16806v1
- Date: Mon, 26 Feb 2024 18:28:05 GMT
- Title: Multi-Human Mesh Recovery with Transformers
- Authors: Zeyu Wang, Zhenzhen Weng, Serena Yeung-Levy
- Abstract summary: We introduce a new model with a streamlined transformer-based design, featuring three critical design choices: multi-scale feature incorporation, focused attention mechanisms, and relative joint supervision.
Our proposed model demonstrates a significant performance improvement, surpassing state-of-the-art region-based and whole-image-based methods on various benchmarks involving multiple individuals.
- Score: 5.420974192779563
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conventional approaches to human mesh recovery predominantly employ a
region-based strategy. This involves initially cropping out a human-centered
region as a preprocessing step, with subsequent modeling focused on this
zoomed-in image. While effective for single figures, this pipeline poses
challenges when dealing with images featuring multiple individuals, as
different people are processed separately, often leading to inaccuracies in
relative positioning. Despite the advantages of adopting a whole-image-based
approach to address this limitation, early efforts in this direction have
fallen short in performance compared to recent region-based methods. In this
work, we advocate for this under-explored area of modeling all people at once,
emphasizing its potential for improved accuracy in multi-person scenarios
through considering all individuals simultaneously and leveraging the overall
context and interactions. We introduce a new model with a streamlined
transformer-based design, featuring three critical design choices: multi-scale
feature incorporation, focused attention mechanisms, and relative joint
supervision. Our proposed model demonstrates a significant performance
improvement, surpassing state-of-the-art region-based and whole-image-based
methods on various benchmarks involving multiple individuals.
Related papers
- Multi-modal Pose Diffuser: A Multimodal Generative Conditional Pose Prior [8.314155285516073]
MOPED is the first method to leverage a novel multi-modal conditional diffusion model as a prior for SMPL pose parameters.
Our method offers powerful unconditional pose generation with the ability to condition on multi-modal inputs such as images and text.
arXiv Detail & Related papers (2024-10-18T15:29:19Z) - DPoser: Diffusion Model as Robust 3D Human Pose Prior [51.75784816929666]
We introduce DPoser, a robust and versatile human pose prior built upon diffusion models.
DPoser regards various pose-centric tasks as inverse problems and employs variational diffusion sampling for efficient solving.
Our approach demonstrates considerable enhancements over common uniform scheduling used in image domains, boasting improvements of 5.4%, 17.2%, and 3.8% across human mesh recovery, pose completion, and motion denoising, respectively.
arXiv Detail & Related papers (2023-12-09T11:18:45Z) - DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose
Estimation [16.32910684198013]
We present DiffPose, a novel diffusion architecture that formulates video-based human pose estimation as a conditional heatmap generation problem.
We show two unique characteristics from DiffPose on pose estimation task: (i) the ability to combine multiple sets of pose estimates to improve prediction accuracy, particularly for challenging joints, and (ii) the ability to adjust the number of iterative steps for feature refinement without retraining the model.
arXiv Detail & Related papers (2023-07-31T14:00:23Z) - IRGen: Generative Modeling for Image Retrieval [82.62022344988993]
In this paper, we present a novel methodology, reframing image retrieval as a variant of generative modeling.
We develop our model, dubbed IRGen, to address the technical challenge of converting an image into a concise sequence of semantic units.
Our model achieves state-of-the-art performance on three widely-used image retrieval benchmarks and two million-scale datasets.
arXiv Detail & Related papers (2023-03-17T17:07:36Z) - Progressive Multi-view Human Mesh Recovery with Self-Supervision [68.60019434498703]
Existing solutions typically suffer from poor generalization performance to new settings.
We propose a novel simulation-based training pipeline for multi-view human mesh recovery.
arXiv Detail & Related papers (2022-12-10T06:28:29Z) - Weakly-Supervised Multi-Face 3D Reconstruction [45.864415499303405]
We propose an effective end-to-end framework for multi-face 3D reconstruction.
We employ the same global camera model for the reconstructed faces in each image, which makes it possible to recover the relative head positions and orientations in the 3D scene.
arXiv Detail & Related papers (2021-01-06T13:15:21Z) - Monocular Real-time Full Body Capture with Inter-part Correlations [66.22835689189237]
We present the first method for real-time full body capture that estimates shape and motion of body and hands together with a dynamic 3D face model from a single color image.
Our approach uses a new neural network architecture that exploits correlations between body and hands at high computational efficiency.
arXiv Detail & Related papers (2020-12-11T02:37:56Z) - Self-supervised Human Detection and Segmentation via Multi-view
Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training.
We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z) - Multi-person 3D Pose Estimation in Crowded Scenes Based on Multi-View
Geometry [62.29762409558553]
Epipolar constraints are at the core of feature matching and depth estimation in multi-person 3D human pose estimation methods.
Despite the satisfactory performance of this formulation in sparser crowd scenes, its effectiveness is frequently challenged under denser crowd circumstances.
In this paper, we depart from the multi-person 3D pose estimation formulation, and instead reformulate it as crowd pose estimation.
arXiv Detail & Related papers (2020-07-21T17:59:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.