Related papers: KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities

KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities

URL: http://arxiv.org/abs/2410.11824v1
Date: Tue, 15 Oct 2024 17:50:37 GMT
Title: KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities
Authors: Hsin-Ping Huang, Xinyi Wang, Yonatan Bitton, Hagai Taitelbaum, Gaurav Singh Tomar, Ming-Wei Chang, Xuhui Jia, Kelvin C. K. Chan, Hexiang Hu, Yu-Chuan Su, Ming-Hsuan Yang,
Abstract summary: We conduct a systematic study on the fidelity of entities in text-to-image generation models. We focus on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals. Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details.
Score: 93.74881034001312
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in text-to-image generation have significantly enhanced the quality of synthesized images. Despite this progress, evaluations predominantly focus on aesthetic appeal or alignment with text prompts. Consequently, there is limited understanding of whether these models can accurately represent a wide variety of realistic visual entities - a task requiring real-world knowledge. To address this gap, we propose a benchmark focused on evaluating Knowledge-InTensive image generaTion on real-world ENtities (i.e., KITTEN). Using KITTEN, we conduct a systematic study on the fidelity of entities in text-to-image generation models, focusing on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals. We evaluate the latest text-to-image models and retrieval-augmented customization models using both automatic metrics and carefully-designed human evaluations, with an emphasis on the fidelity of entities in the generated images. Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details. Although retrieval-augmented models can enhance the fidelity of entity by incorporating reference images during testing, they often over-rely on these references and struggle to produce novel configurations of the entity as requested in creative text prompts.

Related papers

Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment [63.823383517957986]
We propose a novel evaluation score, ICT (Image-Contained-Text) score, that achieves and surpasses the objectives of text-image alignment.<n>We further train an HP (High-Preference) score model using solely the image modality to enhance image aesthetics and detail quality.
arXiv Detail & Related papers (2025-07-25T07:01:50Z)
TokBench: Evaluating Your Visual Tokenizer before Visual Generation [75.38270351179018]
We analyze text and face reconstruction quality across various scales for different image tokenizers and VAEs.<n>Our results show modern visual tokenizers still struggle to preserve fine-grained features, especially at smaller scales.
arXiv Detail & Related papers (2025-05-23T17:52:16Z)
RealRAG: Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning [54.07026389388881]
We present the first real-object-based retrieval-augmented generation framework (RealRAG)<n>RealRAG augments fine-grained and unseen novel object generation by learning and retrieving real-world images to overcome the knowledge gaps of generative models.<n>Our framework integrates fine-grained visual knowledge for the generative models, tackling the distortion problem and improving the realism for fine-grained object generation.
arXiv Detail & Related papers (2025-02-02T16:41:54Z)
Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model. We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z)
TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models [39.06617653124486]
We introduce a new evaluation framework called TypeScore to assess a model's ability to generate images with high-fidelity embedded text. Our proposed metric demonstrates greater resolution than CLIPScore to differentiate popular image generation models.
arXiv Detail & Related papers (2024-11-02T07:56:54Z)
Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models [20.19571676239579]
We introduce a novel diffusion-based framework to enhance the alignment of generated images with their corresponding descriptions. Our framework is built upon a comprehensive analysis of inconsistency phenomena, categorizing them based on their manifestation in the image. We then integrate a state-of-the-art controllable image generation model with a visual text generation module to generate an image that is consistent with the original prompt.
arXiv Detail & Related papers (2024-06-24T06:12:16Z)
ASAP: Interpretable Analysis and Summarization of AI-generated Image Patterns at Scale [20.12991230544801]
Generative image models have emerged as a promising technology to produce realistic images. There is growing demand to empower users to effectively discern and comprehend patterns of AI-generated images. We develop ASAP, an interactive visualization system that automatically extracts distinct patterns of AI-generated images.
arXiv Detail & Related papers (2024-04-03T18:20:41Z)
Harnessing the Power of Large Vision Language Models for Synthetic Image Detection [14.448350657613364]
This study investigates the effectiveness of using advanced vision-language models (VLMs) for synthetic image identification. By harnessing the robust understanding capabilities of large VLMs, the aim is to distinguish authentic images from synthetic images produced by diffusion-based models.
arXiv Detail & Related papers (2024-04-03T13:27:54Z)
A Survey on Quality Metrics for Text-to-Image Models [9.753473063305503]
We provide an overview of existing text-to-image quality metrics addressing their nuances and the need for alignment with human preferences. We propose a new taxonomy for categorizing these metrics, which is grounded in the assumption that there are two main quality criteria, namely compositionality and generality. We derive guidelines for practitioners conducting text-to-image evaluation, discuss open challenges of evaluation mechanisms, and surface limitations of current metrics.
arXiv Detail & Related papers (2024-03-18T14:24:20Z)
RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model [93.8067369210696]
Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions. Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps. In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models.
arXiv Detail & Related papers (2023-09-02T03:27:20Z)
Language Does More Than Describe: On The Lack Of Figurative Speech in Text-To-Image Models [63.545146807810305]
Text-to-image diffusion models can generate high-quality pictures from textual input prompts. These models have been trained using text data collected from content-based labelling protocols. We characterise the sentimentality, objectiveness and degree of abstraction of publicly available text data used to train current text-to-image diffusion models.
arXiv Detail & Related papers (2022-10-19T14:20:05Z)
Re-Imagen: Retrieval-Augmented Text-to-Image Generator [58.60472701831404]
Retrieval-Augmented Text-to-Image Generator (Re-Imagen) Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
arXiv Detail & Related papers (2022-09-29T00:57:28Z)
Improving Generation and Evaluation of Visual Stories via Semantic Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task. We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.