Related papers: SesaHand: Enhancing 3D Hand Reconstruction via Controllable Generation with Semantic and Structural Alignment

SesaHand: Enhancing 3D Hand Reconstruction via Controllable Generation with Semantic and Structural Alignment

URL: http://arxiv.org/abs/2603.00443v1
Date: Sat, 28 Feb 2026 03:51:51 GMT
Title: SesaHand: Enhancing 3D Hand Reconstruction via Controllable Generation with Semantic and Structural Alignment
Authors: Zhuoran Zhao, Xianghao Kong, Linlin Yang, Zheng Wei, Pan Hui, Anyi Rao,
Abstract summary: Generative models are promising alternatives to generate diverse hand images, but still suffer from misalignment issues.<n>We present SesaHand, which enhances controllable hand image generation from both semantic and structural alignment perspectives.<n> Experiments demonstrate that our method not only outperforms prior work in generation performance but also improves 3D hand reconstruction with the generated hand images.
Score: 38.103458669002684
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent studies on 3D hand reconstruction have demonstrated the effectiveness of synthetic training data to improve estimation performance. However, most methods rely on game engines to synthesize hand images, which often lack diversity in textures and environments, and fail to include crucial components like arms or interacting objects. Generative models are promising alternatives to generate diverse hand images, but still suffer from misalignment issues. In this paper, we present SesaHand, which enhances controllable hand image generation from both semantic and structural alignment perspectives for 3D hand reconstruction. Specifically, for semantic alignment, we propose a pipeline with Chain-of-Thought inference to extract human behavior semantics from image captions generated by the Vision-Language Model. This semantics suppresses human-irrelevant environmental details and ensures sufficient human-centric contexts for hand image generation. For structural alignment, we introduce hierarchical structural fusion to integrate structural information with different granularity for feature refinement to better align the hand and the overall human body in generated images. We further propose a hand structure attention enhancement method to efficiently enhance the model's attention on hand regions. Experiments demonstrate that our method not only outperforms prior work in generation performance but also improves 3D hand reconstruction with the generated hand images.

Related papers

HumanCrafter: Synergizing Generalizable Human Reconstruction and Semantic 3D Segmentation [51.27178551863772]
We propose a unified framework that enables the joint modeling of appearance and human-part semantics from a single image.<n>HumanCrafter surpasses existing state-of-the-art methods in both 3D human-part segmentation and 3D human reconstruction from a single image.
arXiv Detail & Related papers (2025-11-01T09:29:36Z)
EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild [79.71523320368388]
Our work aims to reconstruct hand-object interactions from a single-view image.<n>We first design a novel pipeline to estimate the underlying hand pose and object shape.<n>With the initial reconstruction, we employ a prior-guided optimization scheme.
arXiv Detail & Related papers (2024-11-21T16:33:35Z)
RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance [41.213241942526935]
RHanDS is a conditional diffusion-based framework designed to refine malformed hands.<n>The hand mesh reconstructed from the malformed hand offers structure guidance for correcting the structure of the hand.<n>The malformed hand itself provides style guidance for preserving the style of the hand.
arXiv Detail & Related papers (2024-04-22T08:44:34Z)
HandBooster: Boosting 3D Hand-Mesh Reconstruction by Conditional Synthesis and Sampling of Hand-Object Interactions [68.28684509445529]
We present HandBooster, a new approach to uplift the data diversity and boost the 3D hand-mesh reconstruction performance. First, we construct versatile content-aware conditions to guide a diffusion model to produce realistic images with diverse hand appearances, poses, views, and backgrounds. Then, we design a novel condition creator based on our similarity-aware distribution sampling strategies to deliberately find novel and realistic interaction poses that are distinctive from the training set.
arXiv Detail & Related papers (2024-03-27T13:56:08Z)
HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances [34.50137847908887]
Text-to-image generative models can generate high-quality humans, but realism is lost when generating hands. Common artifacts include irregular hand poses, shapes, incorrect numbers of fingers, and physically implausible finger orientations. We propose a novel diffusion-based architecture called HanDiffuser that achieves realism by injecting hand embeddings in the generative process.
arXiv Detail & Related papers (2024-03-04T03:00:22Z)
3D Points Splatting for Real-Time Dynamic Hand Reconstruction [13.392046706568275]
3D Points Splatting Hand Reconstruction (3D-PSHR) is a real-time and photo-realistic hand reconstruction approach. We propose a self-adaptive canonical points up strategy to achieve high-resolution hand geometry representation. To model texture, we disentangle the appearance color into the intrinsic albedo and pose-aware shading.
arXiv Detail & Related papers (2023-12-21T11:50:49Z)
HandNeRF: Learning to Reconstruct Hand-Object Interaction Scene from a Single RGB Image [41.580285338167315]
This paper presents a method to learn hand-object interaction prior for reconstructing a 3D hand-object scene from a single RGB image. We use the hand shape to constrain the possible relative configuration of the hand and object geometry. We show that HandNeRF is able to reconstruct hand-object scenes of novel grasp configurations more accurately than comparable methods.
arXiv Detail & Related papers (2023-09-14T17:42:08Z)
HiFiHR: Enhancing 3D Hand Reconstruction from a Single Image via High-Fidelity Texture [40.012406098563204]
We present HiFiHR, a high-fidelity hand reconstruction approach that utilizes render-and-compare in the learning-based framework from a single image. Experimental results on public benchmarks including FreiHAND and HO-3D demonstrate that our method outperforms the state-of-the-art hand reconstruction methods in texture reconstruction quality.
arXiv Detail & Related papers (2023-08-25T18:48:40Z)
gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction [94.46581592405066]
We exploit the hand structure and use it as guidance for SDF-based shape reconstruction. We predict kinematic chains of pose transformations and align SDFs with highly-articulated hand poses.
arXiv Detail & Related papers (2023-04-24T10:05:48Z)
Joint Hand-object 3D Reconstruction from a Single Image with Cross-branch Feature Fusion [78.98074380040838]
We propose to consider hand and object jointly in feature space and explore the reciprocity of the two branches. We employ an auxiliary depth estimation module to augment the input RGB image with the estimated depth map. Our approach significantly outperforms existing approaches in terms of the reconstruction accuracy of objects.
arXiv Detail & Related papers (2020-06-28T09:50:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.