Related papers: FairHuman: Boosting Hand and Face Quality in Human Image Generation with Minimum Potential Delay Fairness in Diffusion Models

FairHuman: Boosting Hand and Face Quality in Human Image Generation with Minimum Potential Delay Fairness in Diffusion Models

URL: http://arxiv.org/abs/2507.02714v1
Date: Thu, 03 Jul 2025 15:27:45 GMT
Title: FairHuman: Boosting Hand and Face Quality in Human Image Generation with Minimum Potential Delay Fairness in Diffusion Models
Authors: Yuxuan Wang, Tianwei Cao, Huayu Zhang, Zhongjiang He, Kongming Liang, Zhanyu Ma,
Abstract summary: We propose a multi-objective fine-tuning approach designed to enhance both global and local generation quality fairly.<n>We derive the optimal parameter updating strategy under the guidance of the Minimum Potential Delay (MPD) criterion.<n>Our proposed method can achieve significant improvements in generating challenging local details while maintaining overall quality.
Score: 21.03185704537153
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image generation has achieved remarkable progress with the development of large-scale text-to-image models, especially diffusion-based models. However, generating human images with plausible details, such as faces or hands, remains challenging due to insufficient supervision of local regions during training. To address this issue, we propose FairHuman, a multi-objective fine-tuning approach designed to enhance both global and local generation quality fairly. Specifically, we first construct three learning objectives: a global objective derived from the default diffusion objective function and two local objectives for hands and faces based on pre-annotated positional priors. Subsequently, we derive the optimal parameter updating strategy under the guidance of the Minimum Potential Delay (MPD) criterion, thereby attaining fairness-ware optimization for this multi-objective problem. Based on this, our proposed method can achieve significant improvements in generating challenging local details while maintaining overall quality. Extensive experiments showcase the effectiveness of our method in improving the performance of human image generation under different scenarios.

Related papers

Optimizing Human Pose Estimation Through Focused Human and Joint Regions [18.37601213802529]
Human pose estimation has given rise to a broad spectrum of novel and compelling applications, including action recognition, sports analysis, as well as surveillance.<n>Previous methods learn motion clues from all pixels rather than focusing on the target human body, making them easily misled and disrupted by unimportant information such as background changes or movements of other people.<n>We propose a bilayer Human-Keypoint Mask module that performs coarse-to-fine visual token refinement, which gradually zooms in on the target human body and keypoints while masking out unimportant figure regions.<n>Our method achieves state-of-the-art performance on three large-scale benchmark
arXiv Detail & Related papers (2025-01-24T12:17:47Z)
Reward Incremental Learning in Text-to-Image Generation [26.64026346266299]
We present Reward Incremental Distillation (RID), a method that mitigates forgetting with minimal computational overhead. The experimental results demonstrate the efficacy of RID in achieving consistent, high-quality gradient generation in RIL scenarios.
arXiv Detail & Related papers (2024-11-26T10:54:33Z)
MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts [61.274246025372044]
We study human-centric text-to-image generation in context of faces and hands. We propose a method called Mixture of Low-rank Experts (MoLE) by considering low-rank modules trained on close-up hand and face images respectively as experts. This concept draws inspiration from our observation of low-rank refinement, where a low-rank module trained by a customized close-up dataset has the potential to enhance the corresponding image part when applied at an appropriate scale.
arXiv Detail & Related papers (2024-10-30T17:59:57Z)
High Quality Human Image Animation using Regional Supervision and Motion Blur Condition [97.97432499053966]
We leverage regional supervision for detailed regions to enhance face and hand faithfulness. Second, we model the motion blur explicitly to further improve the appearance quality. Third, we explore novel training strategies for high-resolution human animation to improve the overall fidelity.
arXiv Detail & Related papers (2024-09-29T06:46:31Z)
Boost Your Human Image Generation Model via Direct Preference Optimization [5.9726297901501475]
Human image generation is a key focus in image synthesis due to its broad applications, but even slight inaccuracies in anatomy, pose, or details can compromise realism.<n>We explore Direct Preference Optimization (DPO), which trains models to generate preferred (winning) images while diverging from non-preferred (losing) ones.<n>We propose an enhanced DPO approach that incorporates high-quality real images as winning images, encouraging outputs to resemble real images rather than generated ones.
arXiv Detail & Related papers (2024-05-30T16:18:05Z)
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects. We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z)
Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance [48.986552871497]
We introduce a novel two-stage framework that employs scene affordance as an intermediate representation. By leveraging scene affordance maps, our method overcomes the difficulty in generating human motion under multimodal condition signals. Our approach consistently outperforms all baselines on established benchmarks, including HumanML3D and HUMANISE.
arXiv Detail & Related papers (2024-03-26T18:41:07Z)
Multi-Human Mesh Recovery with Transformers [5.420974192779563]
We introduce a new model with a streamlined transformer-based design, featuring three critical design choices: multi-scale feature incorporation, focused attention mechanisms, and relative joint supervision. Our proposed model demonstrates a significant performance improvement, surpassing state-of-the-art region-based and whole-image-based methods on various benchmarks involving multiple individuals.
arXiv Detail & Related papers (2024-02-26T18:28:05Z)
Progressive Multi-view Human Mesh Recovery with Self-Supervision [68.60019434498703]
Existing solutions typically suffer from poor generalization performance to new settings. We propose a novel simulation-based training pipeline for multi-view human mesh recovery.
arXiv Detail & Related papers (2022-12-10T06:28:29Z)
A Generic Approach for Enhancing GANs by Regularized Latent Optimization [79.00740660219256]
We introduce a generic framework called em generative-model inference that is capable of enhancing pre-trained GANs effectively and seamlessly. Our basic idea is to efficiently infer the optimal latent distribution for the given requirements using Wasserstein gradient flow techniques.
arXiv Detail & Related papers (2021-12-07T05:22:50Z)
Optimization-Inspired Learning with Architecture Augmentations and Control Mechanisms for Low-Level Vision [74.9260745577362]
This paper proposes a unified optimization-inspired learning framework to aggregate Generative, Discriminative, and Corrective (GDC) principles. We construct three propagative modules to effectively solve the optimization models with flexible combinations. Experiments across varied low-level vision tasks validate the efficacy and adaptability of GDC.
arXiv Detail & Related papers (2020-12-10T03:24:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.