X-IQE: eXplainable Image Quality Evaluation for Text-to-Image Generation
with Visual Large Language Models
- URL: http://arxiv.org/abs/2305.10843v2
- Date: Fri, 26 May 2023 02:01:54 GMT
- Title: X-IQE: eXplainable Image Quality Evaluation for Text-to-Image Generation
with Visual Large Language Models
- Authors: Yixiong Chen, Li Liu, Chris Ding
- Abstract summary: This paper introduces a novel explainable image quality evaluation approach called X-IQE.
X-IQE uses visual large language models (LLMs) to evaluate text-to-image generation methods by generating textual explanations.
It offers several advantages, including the ability to distinguish between real and generated images, evaluate text-image alignment, and assess image aesthetics without requiring model training or fine-tuning.
- Score: 17.67105465600566
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper introduces a novel explainable image quality evaluation approach
called X-IQE, which leverages visual large language models (LLMs) to evaluate
text-to-image generation methods by generating textual explanations. X-IQE
utilizes a hierarchical Chain of Thought (CoT) to enable MiniGPT-4 to produce
self-consistent, unbiased texts that are highly correlated with human
evaluation. It offers several advantages, including the ability to distinguish
between real and generated images, evaluate text-image alignment, and assess
image aesthetics without requiring model training or fine-tuning. X-IQE is more
cost-effective and efficient compared to human evaluation, while significantly
enhancing the transparency and explainability of deep image quality evaluation
models. We validate the effectiveness of our method as a benchmark using images
generated by prevalent diffusion models. X-IQE demonstrates similar performance
to state-of-the-art (SOTA) evaluation methods on COCO Caption, while overcoming
the limitations of previous evaluation models on DrawBench, particularly in
handling ambiguous generation prompts and text recognition in generated images.
Project website:
https://github.com/Schuture/Benchmarking-Awesome-Diffusion-Models
Related papers
- TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models [39.06617653124486]
We introduce a new evaluation framework called TypeScore to assess a model's ability to generate images with high-fidelity embedded text.
Our proposed metric demonstrates greater resolution than CLIPScore to differentiate popular image generation models.
arXiv Detail & Related papers (2024-11-02T07:56:54Z) - A Novel Evaluation Framework for Image2Text Generation [15.10524860121122]
We propose an evaluation framework rooted in a modern large language model (LLM) capable of image generation.
A high similarity score suggests that the image captioning model has accurately generated textual descriptions.
A low similarity score indicates discrepancies, revealing potential shortcomings in the model's performance.
arXiv Detail & Related papers (2024-08-03T09:27:57Z) - Vision-Language Consistency Guided Multi-modal Prompt Learning for Blind AI Generated Image Quality Assessment [57.07360640784803]
We propose vision-language consistency guided multi-modal prompt learning for blind image quality assessment (AGIQA)
Specifically, we introduce learnable textual and visual prompts in language and vision branches of Contrastive Language-Image Pre-training (CLIP) models.
We design a text-to-image alignment quality prediction task, whose learned vision-language consistency knowledge is used to guide the optimization of the above multi-modal prompts.
arXiv Detail & Related papers (2024-06-24T13:45:31Z) - GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation [103.3465421081531]
VQAScore is a metric measuring the likelihood that a VQA model views an image as accurately depicting the prompt.
Ranking by VQAScore is 2x to 3x more effective than other scoring methods like PickScore, HPSv2, and ImageReward.
We release a new GenAI-Rank benchmark with over 40,000 human ratings to evaluate scoring metrics on ranking images generated from the same prompt.
arXiv Detail & Related papers (2024-06-19T18:00:07Z) - GenzIQA: Generalized Image Quality Assessment using Prompt-Guided Latent Diffusion Models [7.291687946822539]
A major drawback of state-of-the-art NR-IQA methods is their limited ability to generalize across diverse IQA settings.
Recent text-to-image generative models generate meaningful visual concepts with fine details related to text concepts.
In this work, we leverage the denoising process of such diffusion models for generalized IQA by understanding the degree of alignment between learnable quality-aware text prompts and images.
arXiv Detail & Related papers (2024-06-07T05:46:39Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - IQAGPT: Image Quality Assessment with Vision-language and ChatGPT Models [23.99102775778499]
This paper introduces IQAGPT, an innovative image quality assessment system integrating an image quality captioning VLM with ChatGPT.
We build a CT-IQA dataset for training and evaluation, comprising 1,000 CT slices with diverse quality levels professionally annotated.
To better leverage the capabilities of LLMs, we convert annotated quality scores into semantically rich text descriptions using a prompt template.
arXiv Detail & Related papers (2023-12-25T09:13:18Z) - Holistic Evaluation of Text-To-Image Models [153.47415461488097]
We introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM)
We identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency.
Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths.
arXiv Detail & Related papers (2023-11-07T19:00:56Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.