On the Cultural Gap in Text-to-Image Generation
- URL: http://arxiv.org/abs/2307.02971v1
- Date: Thu, 6 Jul 2023 13:17:55 GMT
- Title: On the Cultural Gap in Text-to-Image Generation
- Authors: Bingshuai Liu, Longyue Wang, Chenyang Lyu, Yong Zhang, Jinsong Su,
Shuming Shi, Zhaopeng Tu
- Abstract summary: One challenge in text-to-image (T2I) generation is the inadvertent reflection of culture gaps present in the training data.
There is no benchmark to systematically evaluate a T2I model's ability to generate cross-cultural images.
We propose a Challenging Cross-Cultural (C3) benchmark with comprehensive evaluation criteria, which can assess how well-suited a model is to a target culture.
- Score: 75.69755281031951
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One challenge in text-to-image (T2I) generation is the inadvertent reflection
of culture gaps present in the training data, which signifies the disparity in
generated image quality when the cultural elements of the input text are rarely
collected in the training set. Although various T2I models have shown
impressive but arbitrary examples, there is no benchmark to systematically
evaluate a T2I model's ability to generate cross-cultural images. To bridge the
gap, we propose a Challenging Cross-Cultural (C3) benchmark with comprehensive
evaluation criteria, which can assess how well-suited a model is to a target
culture. By analyzing the flawed images generated by the Stable Diffusion model
on the C3 benchmark, we find that the model often fails to generate certain
cultural objects. Accordingly, we propose a novel multi-modal metric that
considers object-text alignment to filter the fine-tuning data in the target
culture, which is used to fine-tune a T2I model to improve cross-cultural
generation. Experimental results show that our multi-modal metric provides
stronger data selection performance on the C3 benchmark than existing metrics,
in which the object-text alignment is crucial. We release the benchmark, data,
code, and generated images to facilitate future research on culturally diverse
T2I generation (https://github.com/longyuewangdcu/C3-Bench).
Related papers
- Beyond Aesthetics: Cultural Competence in Text-to-Image Models [34.98692829036475]
CUBE is a first-of-its-kind benchmark to evaluate cultural competence of Text-to-Image models.
CUBE covers cultural artifacts associated with 8 countries across different geo-cultural regions.
CUBE-CSpace is a larger dataset of cultural artifacts that serves as grounding to evaluate cultural diversity.
arXiv Detail & Related papers (2024-07-09T13:50:43Z) - PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models [50.33699462106502]
Text-to-image (T2I) models frequently fail to produce images consistent with physical commonsense.
Current T2I evaluation benchmarks focus on metrics such as accuracy, bias, and safety, neglecting the evaluation of models' internal knowledge.
We introduce PhyBench, a comprehensive T2I evaluation dataset comprising 700 prompts across 4 primary categories: mechanics, optics, thermodynamics, and material properties.
arXiv Detail & Related papers (2024-06-17T17:49:01Z) - Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation [5.55027585813848]
The capability to generate visual text is crucial, offering both academic interest and a wide range of practical applications.
We introduce a benchmark, LenCom-Eval, specifically designed for testing models' capability in generating images with Lengthy and Complex visual text.
We demonstrate notable improvements across a range of evaluation metrics, including CLIPScore, OCR precision, recall, F1 score, accuracy, and edit distance scores.
arXiv Detail & Related papers (2024-03-25T04:54:49Z) - Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts [107.32683485639654]
Benchmarks of the multilingual capabilities of text-to-image (T2I) models compare generated images prompted in a test language to an expected image distribution over a concept set.
One such benchmark, "Conceptual Coverage Across Languages" (CoCo-CroLa), assesses the tangible noun inventory of T2I models by prompting them to generate pictures from a concept list translated to seven languages.
We find that this benchmark contains translation errors of varying severity in Spanish, Japanese, and Chinese.
arXiv Detail & Related papers (2024-03-17T05:05:11Z) - SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with
Auto-Generated Data [73.23388142296535]
SELMA improves the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets.
We show that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks.
We also show that fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data.
arXiv Detail & Related papers (2024-03-11T17:35:33Z) - SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation [15.02702600793921]
We propose a novel Self-Contrastive Fine-Tuning (SCoFT) method that leverages the model's known biases to self-improve.
SCoFT is designed to prevent overfitting on small datasets, encode only high-level information from the data, and shift the generated distribution away from misrepresentations encoded in a pretrained model.
arXiv Detail & Related papers (2024-01-16T02:10:13Z) - Paragraph-to-Image Generation with Information-Enriched Diffusion Model [67.9265336953134]
ParaDiffusion is an information-enriched diffusion model for paragraph-to-image generation task.
It delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation.
The code and dataset will be released to foster community research on long-text alignment.
arXiv Detail & Related papers (2023-11-24T05:17:01Z) - Navigating Cultural Chasms: Exploring and Unlocking the Cultural POV of Text-To-Image Models [32.99865895211158]
We explore the cultural perception embedded in Text-To-Image (TTI) models by characterizing culture across three tiers.
We propose a comprehensive suite of evaluation techniques, including intrinsic evaluations using the CLIP space.
To bolster our research, we introduce the CulText2I dataset, derived from six diverse TTI models and spanning ten languages.
arXiv Detail & Related papers (2023-10-03T10:13:36Z) - Visual Programming for Text-to-Image Generation and Evaluation [73.12069620086311]
We propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation.
First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation.
Second, we introduce VPEval, an interpretable and explainable evaluation framework for T2I generation based on visual programming.
arXiv Detail & Related papers (2023-05-24T16:42:17Z) - Towards Equitable Representation in Text-to-Image Synthesis Models with
the Cross-Cultural Understanding Benchmark (CCUB) Dataset [8.006068032606182]
We propose a culturally-aware priming approach for text-to-image synthesis using a small but culturally curated dataset.
Our experiments indicate that priming using both text and image is effective in improving the cultural relevance and decreasing the offensiveness of generated images.
arXiv Detail & Related papers (2023-01-28T03:10:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.