LightFair: Towards an Efficient Alternative for Fair T2I Diffusion via Debiasing Pre-trained Text Encoders
- URL: http://arxiv.org/abs/2509.23639v1
- Date: Sun, 28 Sep 2025 04:46:39 GMT
- Title: LightFair: Towards an Efficient Alternative for Fair T2I Diffusion via Debiasing Pre-trained Text Encoders
- Authors: Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Kangli Zi, Qingming Huang,
- Abstract summary: This paper explores a novel lightweight approach to achieve fair text-to-image diffusion models (T2I DMs) by addressing the adverse effects of the text encoder.<n>Since T2I DMs consist of multiple components, with the text encoder being the most fine-tunable and front-end module, this paper focuses on mitigating bias by fine-tuning text embeddings.<n>Our method achieves SOTA debiasing at just $1/4$ of the training burden, with virtually no increase in sampling burden.
- Score: 84.39846443122853
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper explores a novel lightweight approach LightFair to achieve fair text-to-image diffusion models (T2I DMs) by addressing the adverse effects of the text encoder. Most existing methods either couple different parts of the diffusion model for full-parameter training or rely on auxiliary networks for correction. They incur heavy training or sampling burden and unsatisfactory performance. Since T2I DMs consist of multiple components, with the text encoder being the most fine-tunable and front-end module, this paper focuses on mitigating bias by fine-tuning text embeddings. To validate feasibility, we observe that the text encoder's neutral embedding output shows substantial skewness across image embeddings of various attributes in the CLIP space. More importantly, the noise prediction network further amplifies this imbalance. To finetune the text embedding, we propose a collaborative distance-constrained debiasing strategy that balances embedding distances to improve fairness without auxiliary references. However, mitigating bias can compromise the original generation quality. To address this, we introduce a two-stage text-guided sampling strategy to limit when the debiased text encoder intervenes. Extensive experiments demonstrate that LightFair is effective and efficient. Notably, on Stable Diffusion v1.5, our method achieves SOTA debiasing at just $1/4$ of the training burden, with virtually no increase in sampling burden. The code is available at https://github.com/boyuh/LightFair.
Related papers
- InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models [27.206678799411645]
InfSplign is a training-free inference-time method for text-to-image models.<n>It improves spatial alignment by adjusting the noise through a compound loss in every denoising step.<n>It achieves substantial performance gains over the strongest existing inference-time baselines.
arXiv Detail & Related papers (2025-12-19T17:52:43Z) - Flexible-length Text Infilling for Discrete Diffusion Models [0.8595835526753521]
We introduce textbfDDOT (textbfDiscrete textbfDiffusion with textbfOptimal textbfTransport Position Coupling), the first discrete diffusion model to overcome this challenge.<n>DDOT jointly denoises token values and token positions, employing a novel sample-level Optimal Transport (OT) coupling.<n>Experiments on text infilling benchmarks such as One-Billion-Word and Yelp demonstrate that DDOT outperforms naive diffusion baselines.
arXiv Detail & Related papers (2025-06-16T15:02:12Z) - Aligning Text to Image in Diffusion Models is Easier Than You Think [47.623236425067326]
SoftREPA is a lightweight contrastive fine-tuning strategy that leverages soft text tokens for representation alignment.<n>Our method explicitly increases the mutual information between text and image representations, leading to enhanced semantic consistency.
arXiv Detail & Related papers (2025-03-11T10:14:22Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - Enhancing Diffusion Models with Text-Encoder Reinforcement Learning [63.41513909279474]
Text-to-image diffusion models are typically trained to optimize the log-likelihood objective.
Recent research addresses this issue by refining the diffusion U-Net using human rewards through reinforcement learning or direct backpropagation.
We demonstrate that by finetuning the text encoder through reinforcement learning, we can enhance the text-image alignment of the results.
arXiv Detail & Related papers (2023-11-27T09:39:45Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - FairFil: Contrastive Neural Debiasing Method for Pretrained Text
Encoders [68.8687509471322]
We propose the first neural debiasing method for a pretrained sentence encoder, which transforms the pretrained encoder outputs into debiased representations via a fair filter network.
On real-world datasets, our FairFil effectively reduces the bias degree of pretrained text encoders, while continuously showing desirable performance on downstream tasks.
arXiv Detail & Related papers (2021-03-11T02:01:14Z) - ContourNet: Taking a Further Step toward Accurate Arbitrary-shaped Scene
Text Detection [147.10751375922035]
We propose the ContourNet, which effectively handles false positives and large scale variance of scene texts.
Our method effectively suppresses these false positives by only outputting predictions with high response value in both directions.
arXiv Detail & Related papers (2020-04-10T08:15:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.