Text-Visual Semantic Constrained AI-Generated Image Quality Assessment
- URL: http://arxiv.org/abs/2507.10432v3
- Date: Wed, 16 Jul 2025 15:47:15 GMT
- Title: Text-Visual Semantic Constrained AI-Generated Image Quality Assessment
- Authors: Qiang Li, Qingsen Yan, Haojian Huang, Peng Wu, Haokui Zhang, Yanning Zhang,
- Abstract summary: We propose a unified framework to enhance the comprehensive evaluation of both text-image consistency and perceptual distortion in AI-generated images.<n>Our approach integrates key capabilities from multiple models and tackles the aforementioned challenges by introducing two core modules.<n>Tests conducted on multiple benchmark datasets demonstrate that SC-AGIQA outperforms existing state-of-the-art methods.
- Score: 47.575342788480505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid advancements in Artificial Intelligence Generated Image (AGI) technology, the accurate assessment of their quality has become an increasingly vital requirement. Prevailing methods typically rely on cross-modal models like CLIP or BLIP to evaluate text-image alignment and visual quality. However, when applied to AGIs, these methods encounter two primary challenges: semantic misalignment and details perception missing. To address these limitations, we propose Text-Visual Semantic Constrained AI-Generated Image Quality Assessment (SC-AGIQA), a unified framework that leverages text-visual semantic constraints to significantly enhance the comprehensive evaluation of both text-image consistency and perceptual distortion in AI-generated images. Our approach integrates key capabilities from multiple models and tackles the aforementioned challenges by introducing two core modules: the Text-assisted Semantic Alignment Module (TSAM), which leverages Multimodal Large Language Models (MLLMs) to bridge the semantic gap by generating an image description and comparing it against the original prompt for a refined consistency check, and the Frequency-domain Fine-Grained Degradation Perception Module (FFDPM), which draws inspiration from Human Visual System (HVS) properties by employing frequency domain analysis combined with perceptual sensitivity weighting to better quantify subtle visual distortions and enhance the capture of fine-grained visual quality details in images. Extensive experiments conducted on multiple benchmark datasets demonstrate that SC-AGIQA outperforms existing state-of-the-art methods. The code is publicly available at https://github.com/mozhu1/SC-AGIQA.
Related papers
- Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring [26.174094671736686]
We propose a novel quality-driven data selection pipeline for visual instruction tuning datasets.<n>It integrates a cross-modality assessment framework that first assigns each data entry to its appropriate vision-language task.<n>It generates general and task-specific captions, and evaluates the alignment, clarity, task rarity, text coherence, and image clarity of each entry.
arXiv Detail & Related papers (2025-06-10T04:04:58Z) - ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning [76.2503352325492]
ControlThinker is a novel framework that employs a "comprehend-then-generate" paradigm.<n>Latent semantics from control images are mined to enrich text prompts.<n>This enriched semantic understanding then seamlessly aids in image generation without the need for additional complex modifications.
arXiv Detail & Related papers (2025-06-04T05:56:19Z) - Language-Guided Visual Perception Disentanglement for Image Quality Assessment and Conditional Image Generation [48.642826318384294]
Contrastive vision-language models, such as CLIP, have demonstrated excellent zero-shot capability across semantic recognition tasks.<n>This paper presents a new multimodal disentangled representation learning framework, which leverages disentangled text to guide image disentanglement.
arXiv Detail & Related papers (2025-03-04T02:36:48Z) - M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment [65.3860007085689]
M3-AGIQA is a comprehensive framework that enables more human-aligned, holistic evaluation of AI-generated images.<n>By aligning model outputs more closely with human judgment, M3-AGIQA delivers robust and interpretable quality scores.
arXiv Detail & Related papers (2025-02-21T03:05:45Z) - D-Judge: How Far Are We? Evaluating the Discrepancies Between AI-synthesized Images and Natural Images through Multimodal Guidance [19.760989919485894]
We introduce an AI-Natural Image Discrepancy accessing benchmark (textitD-Judge)<n>We construct textitD-ANI, a dataset with 5,000 natural images and over 440,000 AIGIs generated by nine models using Text-to-Image (T2I), Image-to-Image (I2I), and Text and Image-to-Image (TI2I) prompts.<n>Our framework evaluates the discrepancy across five dimensions: naive image quality, semantic alignment, aesthetic appeal, downstream applicability, and human validation.
arXiv Detail & Related papers (2024-12-23T15:08:08Z) - Vision-Language Consistency Guided Multi-modal Prompt Learning for Blind AI Generated Image Quality Assessment [57.07360640784803]
We propose vision-language consistency guided multi-modal prompt learning for blind image quality assessment (AGIQA)
Specifically, we introduce learnable textual and visual prompts in language and vision branches of Contrastive Language-Image Pre-training (CLIP) models.
We design a text-to-image alignment quality prediction task, whose learned vision-language consistency knowledge is used to guide the optimization of the above multi-modal prompts.
arXiv Detail & Related papers (2024-06-24T13:45:31Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Improving Scene Text Image Super-resolution via Dual Prior Modulation
Network [20.687100711699788]
Scene text image super-resolution (STISR) aims to simultaneously increase the resolution and legibility of the text images.
Existing approaches neglect the global structure of the text, which bounds the semantic determinism of the scene text.
Our work proposes a plug-and-play module dubbed Dual Prior Modulation Network (DPMN), which leverages dual image-level priors to bring performance gain over existing approaches.
arXiv Detail & Related papers (2023-02-21T02:59:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.