VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation
- URL: http://arxiv.org/abs/2312.14867v2
- Date: Mon, 3 Jun 2024 16:59:20 GMT
- Title: VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation
- Authors: Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, Wenhu Chen,
- Abstract summary: VIEScore is a Visual Instruction-guided Explainable metric for evaluating any conditional image generation tasks.
We evaluate VIEScore on seven prominent tasks in conditional image tasks and found: VIEScore (GPT4-o) achieves a high Spearman correlation of 0.4 with human evaluations, while the human-to-human correlation is 0.45.
VIEScore (with open-source MLLM) is significantly weaker than GPT-4o and GPT-4v in evaluating synthetic images.
- Score: 39.88401703956412
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the rapidly advancing field of conditional image generation research, challenges such as limited explainability lie in effectively evaluating the performance and capabilities of various models. This paper introduces VIEScore, a Visual Instruction-guided Explainable metric for evaluating any conditional image generation tasks. VIEScore leverages general knowledge from Multimodal Large Language Models (MLLMs) as the backbone and does not require training or fine-tuning. We evaluate VIEScore on seven prominent tasks in conditional image tasks and found: (1) VIEScore (GPT4-o) achieves a high Spearman correlation of 0.4 with human evaluations, while the human-to-human correlation is 0.45. (2) VIEScore (with open-source MLLM) is significantly weaker than GPT-4o and GPT-4v in evaluating synthetic images. (3) VIEScore achieves a correlation on par with human ratings in the generation tasks but struggles in editing tasks. With these results, we believe VIEScore shows its great potential to replace human judges in evaluating image synthesis tasks.
Related papers
- LLMs as Evaluators: A Novel Approach to Evaluate Bug Report Summarization [9.364214238045317]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various software engineering tasks.
In this study, we investigate whether LLMs can evaluate bug report summarization effectively.
arXiv Detail & Related papers (2024-09-01T06:30:39Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - Evaluating Image Review Ability of Vision Language Models [25.846728716526766]
This paper explores the use of large-scale vision language models (LVLMs) to generate review texts for images.
The ability of LVLMs to review images is not fully understood, highlighting the need for a methodical evaluation of their review abilities.
arXiv Detail & Related papers (2024-02-19T13:16:10Z) - Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs [71.07108539262721]
We design benchmark settings to emulate human language responses related to low-level vision.
We extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs.
We demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than humans.
arXiv Detail & Related papers (2024-02-11T06:44:11Z) - GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks [70.98062518872999]
We validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment.
Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
arXiv Detail & Related papers (2023-11-02T16:11:09Z) - Collaborative Evaluation: Exploring the Synergy of Large Language Models
and Humans for Open-ended Generation Evaluation [71.76872586182981]
Large language models (LLMs) have emerged as a scalable and cost-effective alternative to human evaluations.
We propose a Collaborative Evaluation pipeline CoEval, involving the design of a checklist of task-specific criteria and the detailed evaluation of texts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.