JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human
Mesh Recovery
- URL: http://arxiv.org/abs/2307.16377v2
- Date: Thu, 17 Aug 2023 14:43:05 GMT
- Title: JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human
Mesh Recovery
- Authors: Jiahao Li, Zongxin Yang, Xiaohan Wang, Jianxin Ma, Chang Zhou, Yi Yang
- Abstract summary: This paper presents 3D JOint contrastive learning with TRansformers framework for handling occluded 3D human mesh recovery.
Our method includes an encoder-decoder transformer architecture to fuse 2D and 3D representations for achieving 2D$&$3D aligned results.
- Score: 84.67823511418334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this study, we focus on the problem of 3D human mesh recovery from a
single image under obscured conditions. Most state-of-the-art methods aim to
improve 2D alignment technologies, such as spatial averaging and 2D joint
sampling. However, they tend to neglect the crucial aspect of 3D alignment by
improving 3D representations. Furthermore, recent methods struggle to separate
the target human from occlusion or background in crowded scenes as they
optimize the 3D space of target human with 3D joint coordinates as local
supervision. To address these issues, a desirable method would involve a
framework for fusing 2D and 3D features and a strategy for optimizing the 3D
space globally. Therefore, this paper presents 3D JOint contrastive learning
with TRansformers (JOTR) framework for handling occluded 3D human mesh
recovery. Our method includes an encoder-decoder transformer architecture to
fuse 2D and 3D representations for achieving 2D$\&$3D aligned results in a
coarse-to-fine manner and a novel 3D joint contrastive learning approach for
adding explicitly global supervision for the 3D feature space. The contrastive
learning approach includes two contrastive losses: joint-to-joint contrast for
enhancing the similarity of semantically similar voxels (i.e., human joints),
and joint-to-non-joint contrast for ensuring discrimination from others (e.g.,
occlusions and background). Qualitative and quantitative analyses demonstrate
that our method outperforms state-of-the-art competitors on both
occlusion-specific and standard benchmarks, significantly improving the
reconstruction of occluded humans.
Related papers
- Enhancing 3D Human Pose Estimation Amidst Severe Occlusion with Dual Transformer Fusion [13.938406073551844]
This paper introduces a Dual Transformer Fusion (DTF) algorithm, a novel approach to obtain a holistic 3D pose estimation.
To enable precise 3D Human Pose Estimation, our approach leverages the innovative DTF architecture, which first generates a pair of intermediate views.
Our approach outperforms existing state-of-the-art methods on both datasets, yielding substantial improvements.
arXiv Detail & Related papers (2024-10-06T18:15:27Z) - Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D
priors [16.93758384693786]
Bidirectional Diffusion(BiDiff) is a unified framework that incorporates both a 3D and a 2D diffusion process.
Our model achieves high-quality, diverse, and scalable 3D generation.
arXiv Detail & Related papers (2023-12-07T10:00:04Z) - 3D-Aware Neural Body Fitting for Occlusion Robust 3D Human Pose
Estimation [28.24765523800196]
We propose 3D-aware Neural Body Fitting (3DNBF) for 3D human pose estimation.
In particular, we propose a generative model of deep features based on a volumetric human representation with Gaussian ellipsoidal kernels emitting 3D pose-dependent feature vectors.
The neural features are trained with contrastive learning to become 3D-aware and hence to overcome the 2D-3D ambiguity.
arXiv Detail & Related papers (2023-08-19T22:41:00Z) - Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud
Pre-training [65.75399500494343]
Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training.
arXiv Detail & Related papers (2023-02-27T17:56:18Z) - Deep-MDS Framework for Recovering the 3D Shape of 2D Landmarks from a
Single Image [8.368476827165114]
This paper proposes a framework to recover the 3D shape of 2D landmarks on a human face, in a single input image.
A deep neural network learns the pairwise dissimilarity among 2D landmarks, used by NMDS approach.
arXiv Detail & Related papers (2022-10-27T06:20:10Z) - RiCS: A 2D Self-Occlusion Map for Harmonizing Volumetric Objects [68.85305626324694]
Ray-marching in Camera Space (RiCS) is a new method to represent the self-occlusions of foreground objects in 3D into a 2D self-occlusion map.
We show that our representation map not only allows us to enhance the image quality but also to model temporally coherent complex shadow effects.
arXiv Detail & Related papers (2022-05-14T05:35:35Z) - Homography Loss for Monocular 3D Object Detection [54.04870007473932]
A differentiable loss function, termed as Homography Loss, is proposed to achieve the goal, which exploits both 2D and 3D information.
Our method yields the best performance compared with the other state-of-the-arts by a large margin on KITTI 3D datasets.
arXiv Detail & Related papers (2022-04-02T03:48:03Z) - Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D
Human Pose Estimation [107.07047303858664]
Large-scale human datasets with 3D ground-truth annotations are difficult to obtain in the wild.
We address this problem by augmenting existing 2D datasets with high-quality 3D pose fits.
The resulting annotations are sufficient to train from scratch 3D pose regressor networks that outperform the current state-of-the-art on in-the-wild benchmarks.
arXiv Detail & Related papers (2020-04-07T20:21:18Z) - Learning 3D Human Shape and Pose from Dense Body Parts [117.46290013548533]
We propose a Decompose-and-aggregate Network (DaNet) to learn 3D human shape and pose from dense correspondences of body parts.
Messages from local streams are aggregated to enhance the robust prediction of the rotation-based poses.
Our method is validated on both indoor and real-world datasets including Human3.6M, UP3D, COCO, and 3DPW.
arXiv Detail & Related papers (2019-12-31T15:09:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.