AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild
- URL: http://arxiv.org/abs/2407.18034v1
- Date: Thu, 25 Jul 2024 13:29:32 GMT
- Title: AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild
- Authors: Junho Park, Kyeongbo Kong, Suk-Ju Kang,
- Abstract summary: AttentionHand is a novel method for text-driven controllable hand image generation.
It can generate various and numerous in-the-wild hand images well-aligned with 3D hand label.
It achieves state-of-the-art among text-to-hand image generation models.
- Score: 18.351368674337134
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, there has been a significant amount of research conducted on 3D hand reconstruction to use various forms of human-computer interaction. However, 3D hand reconstruction in the wild is challenging due to extreme lack of in-the-wild 3D hand datasets. Especially, when hands are in complex pose such as interacting hands, the problems like appearance similarity, self-handed occclusion and depth ambiguity make it more difficult. To overcome these issues, we propose AttentionHand, a novel method for text-driven controllable hand image generation. Since AttentionHand can generate various and numerous in-the-wild hand images well-aligned with 3D hand label, we can acquire a new 3D hand dataset, and can relieve the domain gap between indoor and outdoor scenes. Our method needs easy-to-use four modalities (i.e, an RGB image, a hand mesh image from 3D label, a bounding box, and a text prompt). These modalities are embedded into the latent space by the encoding phase. Then, through the text attention stage, hand-related tokens from the given text prompt are attended to highlight hand-related regions of the latent embedding. After the highlighted embedding is fed to the visual attention stage, hand-related regions in the embedding are attended by conditioning global and local hand mesh images with the diffusion-based pipeline. In the decoding phase, the final feature is decoded to new hand images, which are well-aligned with the given hand mesh image and text prompt. As a result, AttentionHand achieved state-of-the-art among text-to-hand image generation models, and the performance of 3D hand mesh reconstruction was improved by additionally training with hand images generated by AttentionHand.
Related papers
- Hand1000: Generating Realistic Hands from Text with Only 1,000 Images [29.562925199318197]
We propose a novel approach named Hand1000 that enables the generation of realistic hand images with target gesture.
The training of Hand1000 is divided into three stages with the first stage aiming to enhance the model's understanding of hand anatomy.
We construct the first publicly available dataset specifically designed for text-to-hand image generation.
arXiv Detail & Related papers (2024-08-28T00:54:51Z) - HandGCAT: Occlusion-Robust 3D Hand Mesh Reconstruction from Monocular Images [9.554136347258057]
We propose a robust and accurate method for reconstructing 3D hand mesh from monocular images.
HandGCAT can fully exploit hand prior as compensation information to enhance occluded region features.
arXiv Detail & Related papers (2024-02-27T03:40:43Z) - HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting [72.95232302438207]
Diffusion models have achieved remarkable success in generating realistic images.
But they suffer from generating accurate human hands, such as incorrect finger counts or irregular shapes.
This paper introduces a lightweight post-processing solution called HandRefiner.
arXiv Detail & Related papers (2023-11-29T08:52:08Z) - HandNeRF: Learning to Reconstruct Hand-Object Interaction Scene from a Single RGB Image [41.580285338167315]
This paper presents a method to learn hand-object interaction prior for reconstructing a 3D hand-object scene from a single RGB image.
We use the hand shape to constrain the possible relative configuration of the hand and object geometry.
We show that HandNeRF is able to reconstruct hand-object scenes of novel grasp configurations more accurately than comparable methods.
arXiv Detail & Related papers (2023-09-14T17:42:08Z) - Recovering 3D Hand Mesh Sequence from a Single Blurry Image: A New
Dataset and Temporal Unfolding [54.49373038369293]
We first present a novel dataset BlurHand, which contains blurry hand images with 3D groundtruths.
The BlurHand is constructed by synthesizing motion blur from sequential sharp hand images, imitating realistic and natural motion blurs.
In addition to the new dataset, we propose BlurHandNet, a baseline network for accurate 3D hand mesh recovery from a blurry hand image.
arXiv Detail & Related papers (2023-03-27T17:40:29Z) - HandOccNet: Occlusion-Robust 3D Hand Mesh Estimation Network [57.206129938611454]
We propose a novel 3D hand mesh estimation network HandOccNet.
By injecting the hand information to the occluded region, our HandOccNet reaches the state-of-the-art performance on 3D hand mesh benchmarks.
arXiv Detail & Related papers (2022-03-28T08:12:16Z) - Consistent 3D Hand Reconstruction in Video via self-supervised Learning [67.55449194046996]
We present a method for reconstructing accurate and consistent 3D hands from a monocular video.
detected 2D hand keypoints and the image texture provide important cues about the geometry and texture of the 3D hand.
We propose $rm S2HAND$, a self-supervised 3D hand reconstruction model.
arXiv Detail & Related papers (2022-01-24T09:44:11Z) - Model-based 3D Hand Reconstruction via Self-Supervised Learning [72.0817813032385]
Reconstructing a 3D hand from a single-view RGB image is challenging due to various hand configurations and depth ambiguity.
We propose S2HAND, a self-supervised 3D hand reconstruction network that can jointly estimate pose, shape, texture, and the camera viewpoint.
For the first time, we demonstrate the feasibility of training an accurate 3D hand reconstruction network without relying on manual annotations.
arXiv Detail & Related papers (2021-03-22T10:12:43Z) - MM-Hand: 3D-Aware Multi-Modal Guided Hand Generative Network for 3D Hand
Pose Synthesis [81.40640219844197]
Estimating the 3D hand pose from a monocular RGB image is important but challenging.
A solution is training on large-scale RGB hand images with accurate 3D hand keypoint annotations.
We have developed a learning-based approach to synthesize realistic, diverse, and 3D pose-preserving hand images.
arXiv Detail & Related papers (2020-10-02T18:27:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.