Towards Equitable Representation in Text-to-Image Synthesis Models with
the Cross-Cultural Understanding Benchmark (CCUB) Dataset
- URL: http://arxiv.org/abs/2301.12073v2
- Date: Wed, 26 Apr 2023 15:41:05 GMT
- Title: Towards Equitable Representation in Text-to-Image Synthesis Models with
the Cross-Cultural Understanding Benchmark (CCUB) Dataset
- Authors: Zhixuan Liu, Youeun Shin, Beverley-Claire Okogwu, Youngsik Yun, Lia
Coleman, Peter Schaldenbrand, Jihie Kim, Jean Oh
- Abstract summary: We propose a culturally-aware priming approach for text-to-image synthesis using a small but culturally curated dataset.
Our experiments indicate that priming using both text and image is effective in improving the cultural relevance and decreasing the offensiveness of generated images.
- Score: 8.006068032606182
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: It has been shown that accurate representation in media improves the
well-being of the people who consume it. By contrast, inaccurate
representations can negatively affect viewers and lead to harmful perceptions
of other cultures. To achieve inclusive representation in generated images, we
propose a culturally-aware priming approach for text-to-image synthesis using a
small but culturally curated dataset that we collected, known here as
Cross-Cultural Understanding Benchmark (CCUB) Dataset, to fight the bias
prevalent in giant datasets. Our proposed approach is comprised of two
fine-tuning techniques: (1) Adding visual context via fine-tuning a pre-trained
text-to-image synthesis model, Stable Diffusion, on the CCUB text-image pairs,
and (2) Adding semantic context via automated prompt engineering using the
fine-tuned large language model, GPT-3, trained on our CCUB culturally-aware
text data. CCUB dataset is curated and our approach is evaluated by people who
have a personal relationship with that particular culture. Our experiments
indicate that priming using both text and image is effective in improving the
cultural relevance and decreasing the offensiveness of generated images while
maintaining quality.
Related papers
- SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation [15.02702600793921]
We propose a novel Self-Contrastive Fine-Tuning (SCoFT) method that leverages the model's known biases to self-improve.
SCoFT is designed to prevent overfitting on small datasets, encode only high-level information from the data, and shift the generated distribution away from misrepresentations encoded in a pretrained model.
arXiv Detail & Related papers (2024-01-16T02:10:13Z) - Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings.
This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs.
A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z) - Leveraging Unpaired Data for Vision-Language Generative Models via Cycle
Consistency [47.3163261953469]
Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities.
We introduce ITIT: an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data.
ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework.
arXiv Detail & Related papers (2023-10-05T17:55:19Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - PLIP: Language-Image Pre-training for Person Representation Learning [51.348303233290025]
We propose a novel language-image pre-training framework for person representation learning, termed PLIP.
To implement our framework, we construct a large-scale person dataset with image-text pairs named SYNTH-PEDES.
PLIP not only significantly improves existing methods on all these tasks, but also shows great ability in the zero-shot and domain generalization settings.
arXiv Detail & Related papers (2023-05-15T06:49:00Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - Improving Text-to-Image Synthesis Using Contrastive Learning [4.850820365312369]
We propose a contrastive learning approach to improve the quality and enhance the semantic consistency of synthetic images.
We evaluate our approach over two popular text-to-image synthesis models, AttnGAN and DM-GAN, on datasets CUB and COCO.
arXiv Detail & Related papers (2021-07-06T06:43:31Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn their representations.
We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.