Related papers: Holistic Evaluation for Interleaved Text-and-Image Generation

Holistic Evaluation for Interleaved Text-and-Image Generation

URL: http://arxiv.org/abs/2406.14643v1
Date: Thu, 20 Jun 2024 18:07:19 GMT
Title: Holistic Evaluation for Interleaved Text-and-Image Generation
Authors: Minqian Liu, Zhiyang Xu, Zihao Lin, Trevor Ashby, Joy Rimchala, Jiaxin Zhang, Lifu Huang,
Abstract summary: We introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation. In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation.
Score: 19.041251355695973
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Interleaved text-and-image generation has been an intriguing research direction, where the models are required to generate both images and text pieces in an arbitrary order. Despite the emerging advancements in interleaved generation, the progress in its evaluation still significantly lags behind. Existing evaluation benchmarks do not support arbitrarily interleaved images and text for both inputs and outputs, and they only cover a limited number of domains and use cases. Also, current works predominantly use similarity-based metrics which fall short in assessing the quality in open-ended scenarios. To this end, we introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation. InterleavedBench features a rich array of tasks to cover diverse real-world use cases. In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation. We carefully define five essential evaluation aspects for InterleavedEval, including text quality, perceptual quality, image coherence, text-image coherence, and helpfulness, to ensure a comprehensive and fine-grained assessment. Through extensive experiments and rigorous human evaluation, we show that our benchmark and metric can effectively evaluate the existing models with a strong correlation with human judgments surpassing previous reference-based metrics. We also provide substantial findings and insights to foster future research in interleaved generation and its evaluation.

Related papers

Re-Thinking the Automatic Evaluation of Image-Text Alignment in Text-to-Image Models [44.05134959039957]
Text-to-image models often struggle to generate images that precisely match textual prompts.<n>Existing evaluations primarily focus on agreement with human assessments.<n>We propose recommendations for improving image-text alignment evaluation.
arXiv Detail & Related papers (2025-06-10T06:11:36Z)
RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation [27.336251972097077]
RefVNLI is a cost-effective metric that evaluates both textual alignment and subject preservation in a single prediction. It outperforms or matches existing baselines across multiple benchmarks and subject categories. It also excels with lesser-known concepts, aligning with human preferences at over 87% accuracy.
arXiv Detail & Related papers (2025-04-24T12:44:51Z)
Evaluating Text-to-Image Synthesis with a Conditional Fréchet Distance [8.216807467478281]
evaluating text-to-image synthesis is challenging due to misalignment between established metrics and human preferences. We propose cFreD, a metric that accounts for both visual fidelity and text-prompt alignment. Our findings validate cFreD as a robust, future-proof metric for the systematic evaluation of text-to-image models.
arXiv Detail & Related papers (2025-03-27T17:35:14Z)
Evaluating Image Caption via Cycle-consistent Text-to-Image Generation [24.455344211552692]
We propose CAMScore, a reference-free automatic evaluation metric for image captioning models. To circumvent the aforementioned modality gap, CAMScore utilizes a text-to-image model to generate images from captions and subsequently evaluates these generated images against the original images. Experiment results show that CAMScore achieves a superior correlation with human judgments compared to existing reference-based and reference-free metrics.
arXiv Detail & Related papers (2025-01-07T06:35:34Z)
Towards Automatic Evaluation for Image Transcreation [52.71090829502756]
We propose a suite of automatic evaluation metrics inspired by machine translation (MT) metrics. We identify three critical dimensions of image transcreation: cultural relevance, semantic equivalence and visual similarity. Our results show that proprietary VLMs best identify cultural relevance and semantic equivalence, while vision-encoder representations are adept at measuring visual similarity.
arXiv Detail & Related papers (2024-12-18T10:55:58Z)
TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models [39.06617653124486]
We introduce a new evaluation framework called TypeScore to assess a model's ability to generate images with high-fidelity embedded text. Our proposed metric demonstrates greater resolution than CLIPScore to differentiate popular image generation models.
arXiv Detail & Related papers (2024-11-02T07:56:54Z)
A Novel Evaluation Framework for Image2Text Generation [15.10524860121122]
We propose an evaluation framework rooted in a modern large language model (LLM) capable of image generation. A high similarity score suggests that the image captioning model has accurately generated textual descriptions. A low similarity score indicates discrepancies, revealing potential shortcomings in the model's performance.
arXiv Detail & Related papers (2024-08-03T09:27:57Z)
CoheSentia: A Novel Benchmark of Incremental versus Holistic Assessment of Coherence in Generated Texts [15.866519123942457]
We introduce sc CoheSentia, a novel benchmark of human-perceived coherence of automatically generated texts. Our benchmark contains 500 automatically-generated and human-annotated paragraphs, each annotated in both methods. Our analysis shows that the inter-annotator agreement in the incremental mode is higher than in the holistic alternative.
arXiv Detail & Related papers (2023-10-25T03:21:20Z)
Likelihood-Based Text-to-Image Evaluation with Patch-Level Perceptual and Semantic Credit Assignment [48.835298314274254]
We propose to evaluate text-to-image generation performance by directly estimating the likelihood of the generated images. A higher likelihood indicates better perceptual quality and better text-image alignment. It can successfully assess the generation ability of these models with as few as a hundred samples.
arXiv Detail & Related papers (2023-08-16T17:26:47Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
TeTIm-Eval: a novel curated evaluation data set for comparing text-to-image models [1.1252184947601962]
evaluating and comparing text-to-image models is a challenging problem. In this paper a novel evaluation approach is experimented, on the basis of: (i) a curated data set, divided into ten categories; (ii) a quantitative metric, the CLIP-score, (iii) a human evaluation task to distinguish, for a given text, the real and the generated images. Early experimental results show that the accuracy of the human judgement is fully coherent with the CLIP-score.
arXiv Detail & Related papers (2022-12-15T13:52:03Z)
Evaluating and Improving Factuality in Multimodal Abstractive Summarization [91.46015013816083]
We propose CLIPBERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary. We show that this simple combination of two metrics in the zero-shot achieves higher correlations than existing factuality metrics for document summarization. Our analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks.
arXiv Detail & Related papers (2022-11-04T16:50:40Z)
Towards a Unified Multi-Dimensional Evaluator for Text Generation [101.47008809623202]
We propose a unified multi-dimensional evaluator UniEval for Natural Language Generation (NLG) We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions. Experiments on three typical NLG tasks show that UniEval correlates substantially better with human judgments than existing metrics.
arXiv Detail & Related papers (2022-10-13T17:17:03Z)
Just Rank: Rethinking Evaluation with Word and Sentence Similarities [105.5541653811528]
intrinsic evaluation for embeddings lags far behind, and there has been no significant update since the past decade. This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations. We propose a new intrinsic evaluation method called EvalRank, which shows a much stronger correlation with downstream tasks.
arXiv Detail & Related papers (2022-03-05T08:40:05Z)
Transparent Human Evaluation for Image Captioning [70.03979566548823]
We develop a rubric-based human evaluation protocol for image captioning models. We show that human-generated captions show substantially higher quality than machine-generated ones. We hope that this work will promote a more transparent evaluation protocol for image captioning.
arXiv Detail & Related papers (2021-11-17T07:09:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.