Related papers: Q-REAL: Towards Realism and Plausibility Evaluation for AI-Generated Content

Q-REAL: Towards Realism and Plausibility Evaluation for AI-Generated Content

URL: http://arxiv.org/abs/2511.16908v1
Date: Fri, 21 Nov 2025 02:43:17 GMT
Title: Q-REAL: Towards Realism and Plausibility Evaluation for AI-Generated Content
Authors: Shushi Wang, Zicheng Zhang, Chunyi Li, Wei Wang, Liya Ma, Fengjiao Chen, Xiaoyu Li, Xuezhi Cao, Guangtao Zhai, Xiaohong Liu,
Abstract summary: We introduce Q-Real, a novel dataset for fine-grained evaluation of realism and plausibility in AI-generated images.<n>Q-Real consists of 3,088 images generated by popular text-to-image models.<n>We construct Q-Real Bench to evaluate them on two tasks: judgment and grounding with reasoning.
Score: 71.46991494014382
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Quality assessment of AI-generated content is crucial for evaluating model capability and guiding model optimization. However, most existing quality assessment datasets and models provide only a single quality score, which is too coarse to offer targeted guidance for improving generative models. In current applications of AI-generated images, realism and plausibility are two critical dimensions, and with the emergence of unified generation-understanding models, fine-grained evaluation along these dimensions becomes especially effective for improving generative performance. Therefore, we introduce Q-Real, a novel dataset for fine-grained evaluation of realism and plausibility in AI-generated images. Q-Real consists of 3,088 images generated by popular text-to-image models. For each image, we annotate the locations of major entities and provide a set of judgment questions and attribution descriptions for these along the dimensions of realism and plausibility. Considering that recent advances in multi-modal large language models (MLLMs) enable fine-grained evaluation of AI-generated images, we construct Q-Real Bench to evaluate them on two tasks: judgment and grounding with reasoning. Finally, to enhance MLLM capabilities, we design a fine-tuning framework and conduct experiments on multiple MLLMs using our dataset. Experimental results demonstrate the high quality and significance of our dataset and the comprehensiveness of the benchmark. Dataset and code will be released upon publication.

Related papers

Evaluating and Preserving High-level Fidelity in Super-Resolution [50.65679806442527]
Super-Resolution (SR) models are achieving impressive effects in reconstructing details and delivering pleasant visually outputs.<n>However, the overpowering generative ability can sometimes hallucinate and thus change the image content.<n>This type of high-level change can be easily identified by humans yet not well-studied in existing low-level image quality metrics.
arXiv Detail & Related papers (2025-12-07T22:53:34Z)
UniREditBench: A Unified Reasoning-based Image Editing Benchmark [52.54256348710893]
This work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation.<n>It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions.<n>We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings.
arXiv Detail & Related papers (2025-11-03T07:24:57Z)
Human-like Content Analysis for Generative AI with Language-Grounded Sparse Encoders [46.13876748421428]
Language-Grounded Sparses (LanSE) decomposes images into interpretable visual patterns with natural language descriptions.<n>Our method discovers more than 5,000 visual patterns with 93% human agreement.<n>Our method's capability to extract language-grounded patterns can be naturally adapted to numerous fields.
arXiv Detail & Related papers (2025-08-20T06:50:15Z)
Quality Assessment and Distortion-aware Saliency Prediction for AI-Generated Omnidirectional Images [70.49595920462579]
This work studies the quality assessment and distortion-aware saliency prediction problems for AIGODIs.<n>We propose two models with shared encoders based on the BLIP-2 model to evaluate the human visual experience and predict distortion-aware saliency for AI-generated omnidirectional images.
arXiv Detail & Related papers (2025-06-27T05:36:04Z)
RAISE: Realness Assessment for Image Synthesis and Evaluation [3.7619101673213664]
We develop and train models on RAISE to establish baselines for realness prediction.<n>Our experimental results demonstrate that features derived from deep foundation vision models can effectively capture the subjective realness.
arXiv Detail & Related papers (2025-05-25T17:14:43Z)
ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing [23.512687688393346]
ICE-Bench is a comprehensive benchmark designed to rigorously assess image generation models.<n>The evaluation framework assesses image generation capabilities across 6 dimensions.<n>We conduct a thorough analysis of existing generation models, revealing both the challenging nature of our benchmark and the gap between current model capabilities and real-world generation requirements.
arXiv Detail & Related papers (2025-03-18T17:53:29Z)
M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment [65.3860007085689]
M3-AGIQA is a comprehensive framework that enables more human-aligned, holistic evaluation of AI-generated images.<n>By aligning model outputs more closely with human judgment, M3-AGIQA delivers robust and interpretable quality scores.
arXiv Detail & Related papers (2025-02-21T03:05:45Z)
EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models [16.18275805302776]
We propose EvalAlign, a metric characterized by its accuracy, stability, and fine granularity. We develop evaluation protocols that focus on two key dimensions: image faithfulness and text-image alignment. EvalAlign aligns more closely with human preferences than existing metrics, confirming its effectiveness and utility in model assessment.
arXiv Detail & Related papers (2024-06-24T11:56:15Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
On quantifying and improving realism of images generated with diffusion [50.37578424163951]
We propose a metric, called Image Realism Score (IRS), computed from five statistical measures of a given image. IRS is easily usable as a measure to classify a given image as real or fake. We experimentally establish the model- and data-agnostic nature of the proposed IRS by successfully detecting fake images generated by Stable Diffusion Model (SDM), Dalle2, Midjourney and BigGAN. Our efforts have also led to Gen-100 dataset, which provides 1,000 samples for 100 classes generated by four high-quality models.
arXiv Detail & Related papers (2023-09-26T08:32:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.