A Survey on Quality Metrics for Text-to-Image Generation
- URL: http://arxiv.org/abs/2403.11821v5
- Date: Wed, 29 Jan 2025 08:48:10 GMT
- Title: A Survey on Quality Metrics for Text-to-Image Generation
- Authors: Sebastian Hartwig, Dominik Engel, Leon Sick, Hannah Kniesel, Tristan Payer, Poonam Poonam, Michael Glöckler, Alex Bäuerle, Timo Ropinski,
- Abstract summary: AI-based text-to-image models do not only excel at generating realistic images, they also give designers more and more fine-grained control over the image content.<n>These approaches have gathered increased attention within the computer graphics research community.<n>We provide a comprehensive overview of such text-to-image quality metrics, and propose a taxonomy to categorize these metrics.
- Score: 9.753473063305503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: AI-based text-to-image models do not only excel at generating realistic images, they also give designers more and more fine-grained control over the image content. Consequently, these approaches have gathered increased attention within the computer graphics research community, which has been historically devoted towards traditional rendering techniques, that offer precise control over scene parameters (e.g., objects, materials, and lighting). While the quality of conventionally rendered images is assessed through well established image quality metrics, such as SSIM or PSNR, the unique challenges of text-to-image generation require other, dedicated quality metrics. These metrics must be able to not only measure overall image quality, but also how well images reflect given text prompts, whereby the control of scene and rendering parameters is interweaved. Within this survey, we provide a comprehensive overview of such text-to-image quality metrics, and propose a taxonomy to categorize these metrics. Our taxonomy is grounded in the assumption, that there are two main quality criteria, namely compositional quality and general quality, that contribute to the overall image quality. Besides the metrics, this survey covers dedicated text-to-image benchmark datasets, over which the metrics are frequently computed. Finally, we identify limitations and open challenges in the field of text-to-image generation, and derive guidelines for practitioners conducting text-to-image evaluation.
Related papers
- Language-Guided Visual Perception Disentanglement for Image Quality Assessment and Conditional Image Generation [48.642826318384294]
Contrastive vision-language models, such as CLIP, have demonstrated excellent zero-shot capability across semantic recognition tasks.
This paper presents a new multimodal disentangled representation learning framework, which leverages disentangled text to guide image disentanglement.
arXiv Detail & Related papers (2025-03-04T02:36:48Z) - Visual question answering based evaluation metrics for text-to-image generation [7.105786967332924]
This paper proposes new evaluation metrics that assess the alignment between input text and generated images for every individual object.
Experimental results show that our proposed evaluation approach is the superior metric that can simultaneously assess finer text-image alignment and image quality.
arXiv Detail & Related papers (2024-11-15T13:32:23Z) - KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities [93.74881034001312]
We conduct a systematic study on the fidelity of entities in text-to-image generation models.
We focus on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals.
Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details.
arXiv Detail & Related papers (2024-10-15T17:50:37Z) - Rank-based No-reference Quality Assessment for Face Swapping [88.53827937914038]
The metric of measuring the quality in most face swapping methods relies on several distances between the manipulated images and the source image.
We present a novel no-reference image quality assessment (NR-IQA) method specifically designed for face swapping.
arXiv Detail & Related papers (2024-06-04T01:36:29Z) - Information Theoretic Text-to-Image Alignment [49.396917351264655]
We present a novel method that relies on an information-theoretic alignment measure to steer image generation.
Our method is on-par or superior to the state-of-the-art, yet requires nothing but a pre-trained denoising network to estimate MI.
arXiv Detail & Related papers (2024-05-31T12:20:02Z) - QUASAR: QUality and Aesthetics Scoring with Advanced Representations [20.194917729936357]
This paper introduces a new data-driven, non-parametric method for image quality and aesthetics assessment.
We eliminate the need for expressive textual embeddings by proposing efficient image anchors in the data.
arXiv Detail & Related papers (2024-03-11T16:21:50Z) - Advancing Generative Model Evaluation: A Novel Algorithm for Realistic
Image Synthesis and Comparison in OCR System [1.2289361708127877]
This research addresses a critical challenge in the field of generative models, particularly in the generation and evaluation of synthetic images.
We introduce a pioneering algorithm to objectively assess the realism of synthetic images.
Our algorithm is particularly tailored to address the challenges in generating and evaluating realistic images of Arabic handwritten digits.
arXiv Detail & Related papers (2024-02-27T04:53:53Z) - TIER: Text-Image Encoder-based Regression for AIGC Image Quality
Assessment [2.59079758388817]
In AIGCIQA tasks, images are typically generated by generative models using text prompts.
Most existing AIGCIQA methods regress predicted scores directly from individual generated images.
We propose a text-image encoder-based regression (TIER) framework to address this issue.
arXiv Detail & Related papers (2024-01-08T12:35:15Z) - Stellar: Systematic Evaluation of Human-Centric Personalized
Text-to-Image Methods [52.806258774051216]
We focus on text-to-image systems that input a single image of an individual and ground the generation process along with text describing the desired visual context.
We introduce a standardized dataset (Stellar) that contains personalized prompts coupled with images of individuals that is an order of magnitude larger than existing relevant datasets and where rich semantic ground-truth annotations are readily available.
We derive a simple yet efficient, personalized text-to-image baseline that does not require test-time fine-tuning for each subject and which sets quantitatively and in human trials a new SoTA.
arXiv Detail & Related papers (2023-12-11T04:47:39Z) - Transparent Human Evaluation for Image Captioning [70.03979566548823]
We develop a rubric-based human evaluation protocol for image captioning models.
We show that human-generated captions show substantially higher quality than machine-generated ones.
We hope that this work will promote a more transparent evaluation protocol for image captioning.
arXiv Detail & Related papers (2021-11-17T07:09:59Z) - Image Quality Assessment in the Modern Age [53.19271326110551]
This tutorial provides the audience with the basic theories, methodologies, and current progresses of image quality assessment (IQA)
We will first revisit several subjective quality assessment methodologies, with emphasis on how to properly select visual stimuli.
Both hand-engineered and (deep) learning-based methods will be covered.
arXiv Detail & Related papers (2021-10-19T02:38:46Z) - Cross-Quality LFW: A Database for Analyzing Cross-Resolution Image Face
Recognition in Unconstrained Environments [8.368543987898732]
Real-world face recognition applications often deal with suboptimal image quality or resolution due to different capturing conditions.
Recent cross-resolution face recognition approaches used simple, arbitrary, and unrealistic down- and up-scaling techniques to measure distances against real-world edge-cases in image quality.
We propose a new standardized benchmark dataset and evaluation protocol derived from the famous Labeled Faces in the Wild.
arXiv Detail & Related papers (2021-08-23T17:04:32Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.