Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D
Human Pose Estimation
- URL: http://arxiv.org/abs/2004.03686v3
- Date: Fri, 22 Oct 2021 02:55:04 GMT
- Title: Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D
Human Pose Estimation
- Authors: Hanbyul Joo, Natalia Neverova, Andrea Vedaldi
- Abstract summary: Large-scale human datasets with 3D ground-truth annotations are difficult to obtain in the wild.
We address this problem by augmenting existing 2D datasets with high-quality 3D pose fits.
The resulting annotations are sufficient to train from scratch 3D pose regressor networks that outperform the current state-of-the-art on in-the-wild benchmarks.
- Score: 107.07047303858664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Differently from 2D image datasets such as COCO, large-scale human datasets
with 3D ground-truth annotations are very difficult to obtain in the wild. In
this paper, we address this problem by augmenting existing 2D datasets with
high-quality 3D pose fits. Remarkably, the resulting annotations are sufficient
to train from scratch 3D pose regressor networks that outperform the current
state-of-the-art on in-the-wild benchmarks such as 3DPW. Additionally, training
on our augmented data is straightforward as it does not require to mix multiple
and incompatible 2D and 3D datasets or to use complicated network architectures
and training procedures. This simplified pipeline affords additional
improvements, including injecting extreme crop augmentations to better
reconstruct highly truncated people, and incorporating auxiliary inputs to
improve 3D pose estimation accuracy. It also reduces the dependency on 3D
datasets such as H36M that have restrictive licenses. We also use our method to
introduce new benchmarks for the study of real-world challenges such as
occlusions, truncations, and rare body poses. In order to obtain such high
quality 3D pseudo-annotations, inspired by progress in internal learning, we
introduce Exemplar Fine-Tuning (EFT). EFT combines the re-projection accuracy
of fitting methods like SMPLify with a 3D pose prior implicitly captured by a
pre-trained 3D pose regressor network. We show that EFT produces 3D annotations
that result in better downstream performance and are qualitatively preferable
in an extensive human-based assessment.
Related papers
- Improving 2D Feature Representations by 3D-Aware Fine-Tuning [17.01280751430423]
Current visual foundation models are trained purely on unstructured 2D data.
We show that fine-tuning on 3D-aware data improves the quality of emerging semantic features.
arXiv Detail & Related papers (2024-07-29T17:59:21Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D
priors [16.93758384693786]
Bidirectional Diffusion(BiDiff) is a unified framework that incorporates both a 3D and a 2D diffusion process.
Our model achieves high-quality, diverse, and scalable 3D generation.
arXiv Detail & Related papers (2023-12-07T10:00:04Z) - Decanus to Legatus: Synthetic training for 2D-3D human pose lifting [26.108023246654646]
We propose an algorithm to generate infinite 3D synthetic human poses (Legatus) from a 3D pose distribution based on 10 initial handcrafted 3D poses (Decanus)
Our results show that we can achieve 3D pose estimation performance comparable to methods using real data from specialized datasets but in a zero-shot setup, showing the potential of our framework.
arXiv Detail & Related papers (2022-10-05T13:10:19Z) - PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and
Hallucination under Self-supervision [102.48681650013698]
Existing self-supervised 3D human pose estimation schemes have largely relied on weak supervisions to guide the learning.
We propose a novel self-supervised approach that allows us to explicitly generate 2D-3D pose pairs for augmenting supervision.
This is made possible via introducing a reinforcement-learning-based imitator, which is learned jointly with a pose estimator alongside a pose hallucinator.
arXiv Detail & Related papers (2022-03-29T14:45:53Z) - Data Efficient 3D Learner via Knowledge Transferred from 2D Model [30.077342050473515]
We deal with the data scarcity challenge of 3D tasks by transferring knowledge from strong 2D models via RGB-D images.
We utilize a strong and well-trained semantic segmentation model for 2D images to augment RGB-D images with pseudo-label.
Our method already outperforms existing state-of-the-art that is tailored for 3D label efficiency.
arXiv Detail & Related papers (2022-03-16T09:14:44Z) - Advancing 3D Medical Image Analysis with Variable Dimension Transform
based Supervised 3D Pre-training [45.90045513731704]
This paper revisits an innovative yet simple fully-supervised 3D network pre-training framework.
With a redesigned 3D network architecture, reformulated natural images are used to address the problem of data scarcity.
Comprehensive experiments on four benchmark datasets demonstrate that the proposed pre-trained models can effectively accelerate convergence.
arXiv Detail & Related papers (2022-01-05T03:11:21Z) - Asymmetric 3D Context Fusion for Universal Lesion Detection [55.61873234187917]
3D networks are strong in 3D context yet lack supervised pretraining.
Existing 3D context fusion operators are designed to be spatially symmetric, performing identical operations on each 2D slice like convolutions.
We propose a novel asymmetric 3D context fusion operator (A3D), which uses different weights to fuse 3D context from different 2D slices.
arXiv Detail & Related papers (2021-09-17T16:25:10Z) - Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image.
The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images.
We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z) - Cascaded deep monocular 3D human pose estimation with evolutionary
training data [76.3478675752847]
Deep representation learning has achieved remarkable accuracy for monocular 3D human pose estimation.
This paper proposes a novel data augmentation method that is scalable for massive amount of training data.
Our method synthesizes unseen 3D human skeletons based on a hierarchical human representation and synthesizings inspired by prior knowledge.
arXiv Detail & Related papers (2020-06-14T03:09:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.