Holistic Evaluation of Text-To-Image Models
- URL: http://arxiv.org/abs/2311.04287v1
- Date: Tue, 7 Nov 2023 19:00:56 GMT
- Title: Holistic Evaluation of Text-To-Image Models
- Authors: Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park,
Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Benita Teufel, Marco
Bellagente, Minguk Kang, Taesung Park, Jure Leskovec, Jun-Yan Zhu, Li
Fei-Fei, Jiajun Wu, Stefano Ermon, Percy Liang
- Abstract summary: We introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM)
We identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency.
Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths.
- Score: 153.47415461488097
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The stunning qualitative improvement of recent text-to-image models has led
to their widespread attention and adoption. However, we lack a comprehensive
quantitative understanding of their capabilities and risks. To fill this gap,
we introduce a new benchmark, Holistic Evaluation of Text-to-Image Models
(HEIM). Whereas previous evaluations focus mostly on text-image alignment and
image quality, we identify 12 aspects, including text-image alignment, image
quality, aesthetics, originality, reasoning, knowledge, bias, toxicity,
fairness, robustness, multilinguality, and efficiency. We curate 62 scenarios
encompassing these aspects and evaluate 26 state-of-the-art text-to-image
models on this benchmark. Our results reveal that no single model excels in all
aspects, with different models demonstrating different strengths. We release
the generated images and human evaluation results for full transparency at
https://crfm.stanford.edu/heim/v1.1.0 and the code at
https://github.com/stanford-crfm/helm, which is integrated with the HELM
codebase.
Related papers
- Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - IQAGPT: Image Quality Assessment with Vision-language and ChatGPT Models [23.99102775778499]
This paper introduces IQAGPT, an innovative image quality assessment system integrating an image quality captioning VLM with ChatGPT.
We build a CT-IQA dataset for training and evaluation, comprising 1,000 CT slices with diverse quality levels professionally annotated.
To better leverage the capabilities of LLMs, we convert annotated quality scores into semantically rich text descriptions using a prompt template.
arXiv Detail & Related papers (2023-12-25T09:13:18Z) - Likelihood-Based Text-to-Image Evaluation with Patch-Level Perceptual
and Semantic Credit Assignment [48.835298314274254]
We propose to evaluate text-to-image generation performance by directly estimating the likelihood of the generated images.
A higher likelihood indicates better perceptual quality and better text-image alignment.
It can successfully assess the generation ability of these models with as few as a hundred samples.
arXiv Detail & Related papers (2023-08-16T17:26:47Z) - X-IQE: eXplainable Image Quality Evaluation for Text-to-Image Generation
with Visual Large Language Models [17.67105465600566]
This paper introduces a novel explainable image quality evaluation approach called X-IQE.
X-IQE uses visual large language models (LLMs) to evaluate text-to-image generation methods by generating textual explanations.
It offers several advantages, including the ability to distinguish between real and generated images, evaluate text-image alignment, and assess image aesthetics without requiring model training or fine-tuning.
arXiv Detail & Related papers (2023-05-18T09:56:44Z) - TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation
with Question Answering [86.38098280689027]
We introduce an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA)
We present a comprehensive evaluation of existing text-to-image models using a benchmark consisting of 4K diverse text inputs and 25K questions across 12 categories (object, counting, etc.)
arXiv Detail & Related papers (2023-03-21T14:41:02Z) - Transparent Human Evaluation for Image Captioning [70.03979566548823]
We develop a rubric-based human evaluation protocol for image captioning models.
We show that human-generated captions show substantially higher quality than machine-generated ones.
We hope that this work will promote a more transparent evaluation protocol for image captioning.
arXiv Detail & Related papers (2021-11-17T07:09:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.