Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild
- URL: http://arxiv.org/abs/2603.02619v1
- Date: Tue, 03 Mar 2026 05:47:18 GMT
- Title: Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild
- Authors: Seunguk Do, Minwoo Huh, Joonghyuk Shin, Jaesik Park,
- Abstract summary: Single-view 3D human reconstruction has achieved remarkable progress, yet the recovered 3D humans often exhibit unnatural poses.<n>We introduce DrPose, Direct Reward fine-tuning algorithm on Poses, which enables post-training of a multi-view diffusion model on diverse poses.<n>DrPose trains a model using only human poses paired with single-view images, employing a direct reward fine-tuning to maximize PoseScore.
- Score: 29.18347483848261
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Single-view 3D human reconstruction has achieved remarkable progress through the adoption of multi-view diffusion models, yet the recovered 3D humans often exhibit unnatural poses. This phenomenon becomes pronounced when reconstructing 3D humans with dynamic or challenging poses, which we attribute to the limited scale of available 3D human datasets with diverse poses. To address this limitation, we introduce DrPose, Direct Reward fine-tuning algorithm on Poses, which enables post-training of a multi-view diffusion model on diverse poses without requiring expensive 3D human assets. DrPose trains a model using only human poses paired with single-view images, employing a direct reward fine-tuning to maximize PoseScore, which is our proposed differentiable reward that quantifies consistency between a generated multi-view latent image and a ground-truth human pose. This optimization is conducted on DrPose15K, a novel dataset that was constructed from an existing human motion dataset and a pose-conditioned video generative model. Constructed from abundant human pose sequence data, DrPose15K exhibits a broader pose distribution compared to existing 3D human datasets. We validate our approach through evaluation on conventional benchmark datasets, in-the-wild images, and a newly constructed benchmark, with a particular focus on assessing performance on challenging human poses. Our results demonstrate consistent qualitative and quantitative improvements across all benchmarks. Project page: https://seunguk-do.github.io/drpose.
Related papers
- DPoser-X: Diffusion Model as Robust 3D Whole-body Human Pose Prior [82.9526308672547]
We present DPoser-X, a diffusion-based prior model for 3D whole-body human poses.<n>Our approach unifies various pose-centric tasks as inverse problems, solving them through variational diffusion sampling.<n>Our model consistently outperforms state-of-the-art alternatives, establishing a new benchmark for whole-body human pose prior modeling.
arXiv Detail & Related papers (2025-08-01T12:56:39Z) - MVD-HuGaS: Human Gaussians from a Single Image via 3D Human Multi-view Diffusion Prior [35.704591162502375]
We present emphMVD-HuGaS, enabling free-view 3D human rendering from a single image via a multi-view human diffusion model.<n>Experiments on Thuman2.0 and 2K2K datasets show that the proposed MVD-HuGaS achieves state-of-the-art performance on single-view 3D human rendering.
arXiv Detail & Related papers (2025-03-11T09:37:15Z) - Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers [28.38686299271394]
We propose a framework for 3D sequence-to-sequence (seq2seq) human pose detection.
Firstly, the spatial module represents the human pose feature by intra-image content, while the frame-image relation module extracts temporal relationships.
Our method is evaluated on Human3.6M, a popular 3D human pose detection dataset.
arXiv Detail & Related papers (2024-01-30T03:00:25Z) - Pose-NDF: Modeling Human Pose Manifolds with Neural Distance Fields [47.62275563070933]
We present a continuous model for plausible human poses based on neural distance fields (NDFs)
Pose-NDF learns a manifold of plausible poses as the zero level set of a neural implicit function.
It can be used to generate more diverse poses by random sampling and projection than VAE-based methods.
arXiv Detail & Related papers (2022-07-27T21:46:47Z) - PoseGU: 3D Human Pose Estimation with Novel Human Pose Generator and
Unbiased Learning [36.609189237732394]
3D pose estimation has recently gained substantial interests in computer vision domain.
Existing 3D pose estimation methods have a strong reliance on large size well-annotated 3D pose datasets.
We propose PoseGU, a novel human pose generator that generates diverse poses with access only to a small size of seed samples.
arXiv Detail & Related papers (2022-07-07T23:43:53Z) - LatentHuman: Shape-and-Pose Disentangled Latent Representation for Human
Bodies [78.17425779503047]
We propose a novel neural implicit representation for the human body.
It is fully differentiable and optimizable with disentangled shape and pose latent spaces.
Our model can be trained and fine-tuned directly on non-watertight raw data with well-designed losses.
arXiv Detail & Related papers (2021-11-30T04:10:57Z) - 3D Multi-bodies: Fitting Sets of Plausible 3D Human Models to Ambiguous
Image Data [77.57798334776353]
We consider the problem of obtaining dense 3D reconstructions of humans from single and partially occluded views.
We suggest that ambiguities can be modelled more effectively by parametrizing the possible body shapes and poses.
We show that our method outperforms alternative approaches in ambiguous pose recovery on standard benchmarks for 3D humans.
arXiv Detail & Related papers (2020-11-02T13:55:31Z) - HMOR: Hierarchical Multi-Person Ordinal Relations for Monocular
Multi-Person 3D Pose Estimation [54.23770284299979]
This paper introduces a novel form of supervision - Hierarchical Multi-person Ordinal Relations (HMOR)
HMOR encodes interaction information as the ordinal relations of depths and angles hierarchically.
An integrated top-down model is designed to leverage these ordinal relations in the learning process.
The proposed method significantly outperforms state-of-the-art methods on publicly available multi-person 3D pose datasets.
arXiv Detail & Related papers (2020-08-01T07:53:27Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.