SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation
- URL: http://arxiv.org/abs/2401.08053v1
- Date: Tue, 16 Jan 2024 02:10:13 GMT
- Title: SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation
- Authors: Zhixuan Liu, Peter Schaldenbrand, Beverley-Claire Okogwu, Wenxuan
Peng, Youngsik Yun, Andrew Hundt, Jihie Kim, Jean Oh
- Abstract summary: We propose a novel Self-Contrastive Fine-Tuning (SCoFT) method that leverages the model's known biases to self-improve.
SCoFT is designed to prevent overfitting on small datasets, encode only high-level information from the data, and shift the generated distribution away from misrepresentations encoded in a pretrained model.
- Score: 15.02702600793921
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate representation in media is known to improve the well-being of the
people who consume it. Generative image models trained on large web-crawled
datasets such as LAION are known to produce images with harmful stereotypes and
misrepresentations of cultures. We improve inclusive representation in
generated images by (1) engaging with communities to collect a culturally
representative dataset that we call the Cross-Cultural Understanding Benchmark
(CCUB) and (2) proposing a novel Self-Contrastive Fine-Tuning (SCoFT) method
that leverages the model's known biases to self-improve. SCoFT is designed to
prevent overfitting on small datasets, encode only high-level information from
the data, and shift the generated distribution away from misrepresentations
encoded in a pretrained model. Our user study conducted on 51 participants from
5 different countries based on their self-selected national cultural
affiliation shows that fine-tuning on CCUB consistently generates images with
higher cultural relevance and fewer stereotypes when compared to the Stable
Diffusion baseline, which is further improved with our SCoFT technique.
Related papers
- Removing Distributional Discrepancies in Captions Improves Image-Text Alignment [76.31530836622694]
We introduce a model designed to improve the prediction of image-text alignment.
Our approach focuses on generating high-quality training datasets for the alignment task.
We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment.
arXiv Detail & Related papers (2024-10-01T17:50:17Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [99.9389737339175]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - FairRAG: Fair Human Generation via Fair Retrieval Augmentation [27.069276012884398]
We introduce Fair Retrieval Augmented Generation (FairRAG), a novel framework that conditions pre-trained generative models on reference images retrieved from an external image database to improve fairness in human generation.
To enhance fairness, FairRAG applies simple-yet-effective debiasing strategies, providing images from diverse demographic groups during the generative process.
arXiv Detail & Related papers (2024-03-29T03:56:19Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - On quantifying and improving realism of images generated with diffusion [50.37578424163951]
We propose a metric, called Image Realism Score (IRS), computed from five statistical measures of a given image.
IRS is easily usable as a measure to classify a given image as real or fake.
We experimentally establish the model- and data-agnostic nature of the proposed IRS by successfully detecting fake images generated by Stable Diffusion Model (SDM), Dalle2, Midjourney and BigGAN.
Our efforts have also led to Gen-100 dataset, which provides 1,000 samples for 100 classes generated by four high-quality models.
arXiv Detail & Related papers (2023-09-26T08:32:55Z) - On the Cultural Gap in Text-to-Image Generation [75.69755281031951]
One challenge in text-to-image (T2I) generation is the inadvertent reflection of culture gaps present in the training data.
There is no benchmark to systematically evaluate a T2I model's ability to generate cross-cultural images.
We propose a Challenging Cross-Cultural (C3) benchmark with comprehensive evaluation criteria, which can assess how well-suited a model is to a target culture.
arXiv Detail & Related papers (2023-07-06T13:17:55Z) - Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness [15.059419033330126]
We present a novel strategy, called Fair Diffusion, to attenuate biases after the deployment of generative text-to-image models.
Specifically, we demonstrate shifting a bias, based on human instructions, in any direction yielding arbitrarily new proportions for, e.g., identity groups.
This introduced control enables instructing generative image models on fairness, with no data filtering and additional training required.
arXiv Detail & Related papers (2023-02-07T18:25:28Z) - Towards Equitable Representation in Text-to-Image Synthesis Models with
the Cross-Cultural Understanding Benchmark (CCUB) Dataset [8.006068032606182]
We propose a culturally-aware priming approach for text-to-image synthesis using a small but culturally curated dataset.
Our experiments indicate that priming using both text and image is effective in improving the cultural relevance and decreasing the offensiveness of generated images.
arXiv Detail & Related papers (2023-01-28T03:10:33Z) - CAGAN: Text-To-Image Generation with Combined Attention GANs [70.3497683558609]
We propose the Combined Attention Generative Adversarial Network (CAGAN) to generate photo-realistic images according to textual descriptions.
The proposed CAGAN uses two attention models: word attention to draw different sub-regions conditioned on related words; and squeeze-and-excitation attention to capture non-linear interaction among channels.
With spectral normalisation to stabilise training, our proposed CAGAN improves the state of the art on the IS and FID on the CUB dataset and the FID on the more challenging COCO dataset.
arXiv Detail & Related papers (2021-04-26T15:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.