A Unified Agentic Framework for Evaluating Conditional Image Generation
- URL: http://arxiv.org/abs/2504.07046v1
- Date: Wed, 09 Apr 2025 17:04:14 GMT
- Title: A Unified Agentic Framework for Evaluating Conditional Image Generation
- Authors: Jifang Wang, Xue Yang, Longyue Wang, Zhenran Xu, Yiyu Wang, Yaowei Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang,
- Abstract summary: Conditional image generation has gained significant attention for its ability to personalize content.<n>This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks.
- Score: 66.25099219134441
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks. CIGEval utilizes large multimodal models (LMMs) as its core, integrating a multi-functional toolbox and establishing a fine-grained evaluation framework. Additionally, we synthesize evaluation trajectories for fine-tuning, empowering smaller LMMs to autonomously select appropriate tools and conduct nuanced analyses based on tool outputs. Experiments across seven prominent conditional image generation tasks demonstrate that CIGEval (GPT-4o version) achieves a high correlation of 0.4625 with human assessments, closely matching the inter-annotator correlation of 0.47. Moreover, when implemented with 7B open-source LMMs using only 2.3K training trajectories, CIGEval surpasses the previous GPT-4o-based state-of-the-art method. Case studies on GPT-4o image generation highlight CIGEval's capability in identifying subtle issues related to subject consistency and adherence to control guidance, indicating its great potential for automating evaluation of image generation tasks with human-level reliability.
Related papers
- ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing [23.512687688393346]
ICE-Bench is a comprehensive benchmark designed to rigorously assess image generation models.<n>The evaluation framework assesses image generation capabilities across 6 dimensions.<n>We conduct a thorough analysis of existing generation models, revealing both the challenging nature of our benchmark and the gap between current model capabilities and real-world generation requirements.
arXiv Detail & Related papers (2025-03-18T17:53:29Z) - Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [77.86514804787622]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks.
We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation.
We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z) - OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain.<n>Our benchmark is characterized by its multi-dimensional evaluation framework.<n>Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z) - Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment [53.45813302866466]
We present ISG, a comprehensive evaluation framework for interleaved text-and-image generation.
ISG evaluates responses on four levels of granularity: holistic, structural, block-level, and image-specific.
In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8 categories and 21 subcategories.
arXiv Detail & Related papers (2024-11-26T07:55:57Z) - Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset.
We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6.
Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z) - Language Model Preference Evaluation with Multiple Weak Evaluators [78.53743237977677]
GED (Preference Graph Ensemble and Denoise) is a novel approach that leverages multiple model-based evaluators to construct preference graphs.<n>In particular, our method consists of two primary stages: aggregating evaluations into a unified graph and applying a denoising process.<n>We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure.
arXiv Detail & Related papers (2024-10-14T01:57:25Z) - Global-Local Image Perceptual Score (GLIPS): Evaluating Photorealistic Quality of AI-Generated Images [0.7499722271664147]
The Global-Local Image Perceptual Score (GLIPS) is an image metric designed to assess the photorealistic image quality of AI-generated images.
Comprehensive tests across various generative models demonstrate that GLIPS consistently outperforms existing metrics like FID, SSIM, and MS-SSIM in terms of correlation with human scores.
arXiv Detail & Related papers (2024-05-15T15:19:23Z) - GMC-IQA: Exploiting Global-correlation and Mean-opinion Consistency for
No-reference Image Quality Assessment [40.33163764161929]
We construct a novel loss function and network to exploit Global-correlation and Mean-opinion Consistency.
We propose a novel GCC loss by defining a pairwise preference-based rank estimation to solve the non-differentiable problem of SROCC.
We also propose a mean-opinion network, which integrates diverse opinion features to alleviate the randomness of weight learning.
arXiv Detail & Related papers (2024-01-19T06:03:01Z) - VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation [39.88401703956412]
VIEScore is a Visual Instruction-guided Explainable metric for evaluating any conditional image generation tasks.
We evaluate VIEScore on seven prominent tasks in conditional image tasks and found: VIEScore (GPT4-o) achieves a high Spearman correlation of 0.4 with human evaluations, while the human-to-human correlation is 0.45.
VIEScore (with open-source MLLM) is significantly weaker than GPT-4o and GPT-4v in evaluating synthetic images.
arXiv Detail & Related papers (2023-12-22T17:45:19Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Generalized Visual Quality Assessment of GAN-Generated Face Images [79.47386781978531]
We study the subjective and objective quality towards generalized quality assessment of GAN-generated face images (GFIs)
We develop a quality assessment model that is able to deliver accurate quality predictions for GFIs from both available and unseen GAN algorithms.
arXiv Detail & Related papers (2022-01-28T07:54:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.