Related papers: VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation

VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation

URL: http://arxiv.org/abs/2312.14867v2
Date: Mon, 3 Jun 2024 16:59:20 GMT
Title: VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation
Authors: Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, Wenhu Chen,
Abstract summary: VIEScore is a Visual Instruction-guided Explainable metric for evaluating any conditional image generation tasks. We evaluate VIEScore on seven prominent tasks in conditional image tasks and found: VIEScore (GPT4-o) achieves a high Spearman correlation of 0.4 with human evaluations, while the human-to-human correlation is 0.45. VIEScore (with open-source MLLM) is significantly weaker than GPT-4o and GPT-4v in evaluating synthetic images.
Score: 39.88401703956412
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the rapidly advancing field of conditional image generation research, challenges such as limited explainability lie in effectively evaluating the performance and capabilities of various models. This paper introduces VIEScore, a Visual Instruction-guided Explainable metric for evaluating any conditional image generation tasks. VIEScore leverages general knowledge from Multimodal Large Language Models (MLLMs) as the backbone and does not require training or fine-tuning. We evaluate VIEScore on seven prominent tasks in conditional image tasks and found: (1) VIEScore (GPT4-o) achieves a high Spearman correlation of 0.4 with human evaluations, while the human-to-human correlation is 0.45. (2) VIEScore (with open-source MLLM) is significantly weaker than GPT-4o and GPT-4v in evaluating synthetic images. (3) VIEScore achieves a correlation on par with human ratings in the generation tasks but struggles in editing tasks. With these results, we believe VIEScore shows its great potential to replace human judges in evaluating image synthesis tasks.

Related papers

A Unified Agentic Framework for Evaluating Conditional Image Generation [66.25099219134441]
Conditional image generation has gained significant attention for its ability to personalize content. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks.
arXiv Detail & Related papers (2025-04-09T17:04:14Z)
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation [36.40760924116748]
Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA) Existing evaluation methods face limitations due to the significant human workload required to design Q&A pairs for visual images. We propose an Unsupervised Peer review MLLM Evaluation framework, which allows models to automatically generate questions and conduct peer review assessments of answers from other models.
arXiv Detail & Related papers (2025-03-19T07:15:41Z)
Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset. We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6. Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z)
HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks [25.959032350818795]
We present HumanEval-V, a benchmark of human-annotated coding tasks. Each task features carefully crafted diagrams paired with function signatures and test cases. We find that even top-performing models achieve modest success rates.
arXiv Detail & Related papers (2024-10-16T09:04:57Z)
LLMs as Evaluators: A Novel Approach to Evaluate Bug Report Summarization [9.364214238045317]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various software engineering tasks. In this study, we investigate whether LLMs can evaluate bug report summarization effectively.
arXiv Detail & Related papers (2024-09-01T06:30:39Z)
Evaluating Image Review Ability of Vision Language Models [25.846728716526766]
This paper explores the use of large-scale vision language models (LVLMs) to generate review texts for images. The ability of LVLMs to review images is not fully understood, highlighting the need for a methodical evaluation of their review abilities.
arXiv Detail & Related papers (2024-02-19T13:16:10Z)
Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs [71.07108539262721]
We design benchmark settings to emulate human language responses related to low-level vision. We extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs. We demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than humans.
arXiv Detail & Related papers (2024-02-11T06:44:11Z)
GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks [70.98062518872999]
We validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment. Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
arXiv Detail & Related papers (2023-11-02T16:11:09Z)
Collaborative Evaluation: Exploring the Synergy of Large Language Models and Humans for Open-ended Generation Evaluation [71.76872586182981]
Large language models (LLMs) have emerged as a scalable and cost-effective alternative to human evaluations. We propose a Collaborative Evaluation pipeline CoEval, involving the design of a checklist of task-specific criteria and the detailed evaluation of texts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z)
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs. We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.