Fair Text-to-Image Diffusion via Fair Mapping
- URL: http://arxiv.org/abs/2311.17695v2
- Date: Wed, 6 Mar 2024 11:13:43 GMT
- Title: Fair Text-to-Image Diffusion via Fair Mapping
- Authors: Jia Li, Lijie Hu, Jingfeng Zhang, Tianhang Zheng, Hua Zhang, Di Wang
- Abstract summary: We propose a flexible, model-agnostic, and lightweight approach that modifies a pre-trained text-to-image diffusion model.
By effectively addressing the issue of implicit language bias, our method produces more fair and diverse image outputs.
- Score: 32.02815667307623
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we address the limitations of existing text-to-image diffusion
models in generating demographically fair results when given human-related
descriptions. These models often struggle to disentangle the target language
context from sociocultural biases, resulting in biased image generation. To
overcome this challenge, we propose Fair Mapping, a flexible, model-agnostic,
and lightweight approach that modifies a pre-trained text-to-image diffusion
model by controlling the prompt to achieve fair image generation. One key
advantage of our approach is its high efficiency. It only requires updating an
additional linear network with few parameters at a low computational cost. By
developing a linear network that maps conditioning embeddings into a debiased
space, we enable the generation of relatively balanced demographic results
based on the specified text condition. With comprehensive experiments on face
image generation, we show that our method significantly improves image
generation fairness with almost the same image quality compared to conventional
diffusion models when prompted with descriptions related to humans. By
effectively addressing the issue of implicit language bias, our method produces
more fair and diverse image outputs.
Related papers
- ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models [52.23899502520261]
We introduce a new framework named ARTIST to focus on the learning of text structures.
We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model.
Empirical results on the MARIO-Eval benchmark underscore the effectiveness of the proposed method, showing an improvement of up to 15% in various metrics.
arXiv Detail & Related papers (2024-06-17T19:31:24Z) - Improving face generation quality and prompt following with synthetic captions [57.47448046728439]
We introduce a training-free pipeline designed to generate accurate appearance descriptions from images of people.
We then use these synthetic captions to fine-tune a text-to-image diffusion model.
Our results demonstrate that this approach significantly improves the model's ability to generate high-quality, realistic human faces.
arXiv Detail & Related papers (2024-05-17T15:50:53Z) - FairRAG: Fair Human Generation via Fair Retrieval Augmentation [27.069276012884398]
We introduce Fair Retrieval Augmented Generation (FairRAG), a novel framework that conditions pre-trained generative models on reference images retrieved from an external image database to improve fairness in human generation.
To enhance fairness, FairRAG applies simple-yet-effective debiasing strategies, providing images from diverse demographic groups during the generative process.
arXiv Detail & Related papers (2024-03-29T03:56:19Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners [88.07317175639226]
We propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners.
Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information.
arXiv Detail & Related papers (2023-05-18T05:41:36Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.