MGHanD: Multi-modal Guidance for authentic Hand Diffusion
- URL: http://arxiv.org/abs/2503.08133v1
- Date: Tue, 11 Mar 2025 07:51:47 GMT
- Title: MGHanD: Multi-modal Guidance for authentic Hand Diffusion
- Authors: Taehyeon Eum, Jieun Choi, Tae-Kyun Kim,
- Abstract summary: MGHanD addresses persistent challenges in generating realistic human hands.<n>We employ a discriminator trained on a dataset comprising paired real and generated images with captions.<n>We also employ textual guidance with LoRA adapter, which learns the direction from hands' towards more detailed prompts.
- Score: 25.887930576638293
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion-based methods have achieved significant successes in T2I generation, providing realistic images from text prompts. Despite their capabilities, these models face persistent challenges in generating realistic human hands, often producing images with incorrect finger counts and structurally deformed hands. MGHanD addresses this challenge by applying multi-modal guidance during the inference process. For visual guidance, we employ a discriminator trained on a dataset comprising paired real and generated images with captions, derived from various hand-in-the-wild datasets. We also employ textual guidance with LoRA adapter, which learns the direction from `hands' towards more detailed prompts such as `natural hands', and `anatomically correct fingers' at the latent level. A cumulative hand mask which is gradually enlarged in the assigned time step is applied to the added guidance, allowing the hand to be refined while maintaining the rich generative capabilities of the pre-trained model. In the experiments, our method achieves superior hand generation qualities, without any specific conditions or priors. We carry out both quantitative and qualitative evaluations, along with user studies, to showcase the benefits of our approach in producing high-quality hand images.
Related papers
- FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation [11.843140646170458]
We present FoundHand, a large-scale domain-specific diffusion model for single and dual hand images.
We use FoundHand-10M, a large-scale hand dataset with 2D keypoints and segmentation mask annotations.
Our model exhibits core capabilities that include the ability to repose hands, transfer hand appearance, and even synthesize novel views.
arXiv Detail & Related papers (2024-12-03T18:58:19Z) - MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts [61.274246025372044]
We study human-centric text-to-image generation in context of faces and hands.
We propose a method called Mixture of Low-rank Experts (MoLE) by considering low-rank modules trained on close-up hand and face images respectively as experts.
This concept draws inspiration from our observation of low-rank refinement, where a low-rank module trained by a customized close-up dataset has the potential to enhance the corresponding image part when applied at an appropriate scale.
arXiv Detail & Related papers (2024-10-30T17:59:57Z) - Hand1000: Generating Realistic Hands from Text with Only 1,000 Images [29.562925199318197]
We propose a novel approach named Hand1000 that enables the generation of realistic hand images with target gesture.
The training of Hand1000 is divided into three stages with the first stage aiming to enhance the model's understanding of hand anatomy.
We construct the first publicly available dataset specifically designed for text-to-hand image generation.
arXiv Detail & Related papers (2024-08-28T00:54:51Z) - RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance [41.213241942526935]
RHanDS is a conditional diffusion-based framework designed to refine malformed hands.
The hand mesh reconstructed from the malformed hand offers structure guidance for correcting the structure of the hand.
The malformed hand itself provides style guidance for preserving the style of the hand.
arXiv Detail & Related papers (2024-04-22T08:44:34Z) - Giving a Hand to Diffusion Models: a Two-Stage Approach to Improving Conditional Human Image Generation [29.79050316749927]
We introduce a novel approach to pose-conditioned human image generation, dividing the process into two stages: hand generation and subsequent body outpainting around the hands.
A novel blending technique is introduced to preserve the hand details during the second stage that combines the results of both stages in a coherent way.
Our approach not only enhances the quality of the generated hands but also offers improved control over hand pose, advancing the capabilities of pose-conditioned human image generation.
arXiv Detail & Related papers (2024-03-15T23:31:41Z) - HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances [34.50137847908887]
Text-to-image generative models can generate high-quality humans, but realism is lost when generating hands.
Common artifacts include irregular hand poses, shapes, incorrect numbers of fingers, and physically implausible finger orientations.
We propose a novel diffusion-based architecture called HanDiffuser that achieves realism by injecting hand embeddings in the generative process.
arXiv Detail & Related papers (2024-03-04T03:00:22Z) - Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis [65.7968515029306]
We propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS)
A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt.
arXiv Detail & Related papers (2024-02-28T06:07:07Z) - HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting [72.95232302438207]
Diffusion models have achieved remarkable success in generating realistic images.
But they suffer from generating accurate human hands, such as incorrect finger counts or irregular shapes.
This paper introduces a lightweight post-processing solution called HandRefiner.
arXiv Detail & Related papers (2023-11-29T08:52:08Z) - HandNeRF: Neural Radiance Fields for Animatable Interacting Hands [122.32855646927013]
We propose a novel framework to reconstruct accurate appearance and geometry with neural radiance fields (NeRF) for interacting hands.
We conduct extensive experiments to verify the merits of our proposed HandNeRF and report a series of state-of-the-art results.
arXiv Detail & Related papers (2023-03-24T06:19:19Z) - Im2Hands: Learning Attentive Implicit Representation of Interacting
Two-Hand Shapes [58.551154822792284]
Implicit Two Hands (Im2Hands) is the first neural implicit representation of two interacting hands.
Im2Hands can produce fine-grained geometry of two hands with high hand-to-hand and hand-to-image coherency.
We experimentally demonstrate the effectiveness of Im2Hands on two-hand reconstruction in comparison to related methods.
arXiv Detail & Related papers (2023-02-28T06:38:25Z) - MM-Hand: 3D-Aware Multi-Modal Guided Hand Generative Network for 3D Hand
Pose Synthesis [81.40640219844197]
Estimating the 3D hand pose from a monocular RGB image is important but challenging.
A solution is training on large-scale RGB hand images with accurate 3D hand keypoint annotations.
We have developed a learning-based approach to synthesize realistic, diverse, and 3D pose-preserving hand images.
arXiv Detail & Related papers (2020-10-02T18:27:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.