Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment
- URL: http://arxiv.org/abs/2507.19002v1
- Date: Fri, 25 Jul 2025 07:01:50 GMT
- Title: Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment
- Authors: Ying Ba, Tianyu Zhang, Yalong Bai, Wenyi Mo, Tao Liang, Bing Su, Ji-Rong Wen,
- Abstract summary: We propose a novel evaluation score, ICT (Image-Contained-Text) score, that achieves and surpasses the objectives of text-image alignment.<n>We further train an HP (High-Preference) score model using solely the image modality to enhance image aesthetics and detail quality.
- Score: 63.823383517957986
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contemporary image generation systems have achieved high fidelity and superior aesthetic quality beyond basic text-image alignment. However, existing evaluation frameworks have failed to evolve in parallel. This study reveals that human preference reward models fine-tuned based on CLIP and BLIP architectures have inherent flaws: they inappropriately assign low scores to images with rich details and high aesthetic value, creating a significant discrepancy with actual human aesthetic preferences. To address this issue, we design a novel evaluation score, ICT (Image-Contained-Text) score, that achieves and surpasses the objectives of text-image alignment by assessing the degree to which images represent textual content. Building upon this foundation, we further train an HP (High-Preference) score model using solely the image modality to enhance image aesthetics and detail quality while maintaining text-image alignment. Experiments demonstrate that the proposed evaluation model improves scoring accuracy by over 10\% compared to existing methods, and achieves significant results in optimizing state-of-the-art text-to-image models. This research provides theoretical and empirical support for evolving image generation technology toward higher-order human aesthetic preferences. Code is available at https://github.com/BarretBa/ICTHP.
Related papers
- Scene Perceived Image Perceptual Score (SPIPS): combining global and local perception for image quality assessment [0.0]
We propose a novel IQA approach that bridges the gap between deep learning methods and human perception.<n>Our model disentangles deep features into high-level semantic information and low-level perceptual details, treating each stream separately.<n>This hybrid design enables the model to assess both global context and intricate image details, better reflecting the human visual process.
arXiv Detail & Related papers (2025-04-24T04:06:07Z) - TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models [39.06617653124486]
We introduce a new evaluation framework called TypeScore to assess a model's ability to generate images with high-fidelity embedded text.
Our proposed metric demonstrates greater resolution than CLIPScore to differentiate popular image generation models.
arXiv Detail & Related papers (2024-11-02T07:56:54Z) - KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities [93.74881034001312]
We conduct a systematic study on the fidelity of entities in text-to-image generation models.
We focus on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals.
Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details.
arXiv Detail & Related papers (2024-10-15T17:50:37Z) - A Survey on Quality Metrics for Text-to-Image Generation [9.753473063305503]
AI-based text-to-image models do not only excel at generating realistic images, they also give designers more and more fine-grained control over the image content.<n>These approaches have gathered increased attention within the computer graphics research community.<n>We provide a comprehensive overview of such text-to-image quality metrics, and propose a taxonomy to categorize these metrics.
arXiv Detail & Related papers (2024-03-18T14:24:20Z) - Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image Synthesis [21.619269792415903]
We present an empirical study introducing a nuanced evaluation framework for text-to-image (T2I) generative models.
Our framework categorizes evaluations into two distinct groups: first, focusing on image qualities such as aesthetics and realism, and second, examining text conditions through concept coverage and fairness.
arXiv Detail & Related papers (2024-03-08T07:41:47Z) - ENTED: Enhanced Neural Texture Extraction and Distribution for
Reference-based Blind Face Restoration [51.205673783866146]
We present ENTED, a new framework for blind face restoration that aims to restore high-quality and realistic portrait images.
We utilize a texture extraction and distribution framework to transfer high-quality texture features between the degraded input and reference image.
The StyleGAN-like architecture in our framework requires high-quality latent codes to generate realistic images.
arXiv Detail & Related papers (2024-01-13T04:54:59Z) - Holistic Evaluation of Text-To-Image Models [153.47415461488097]
We introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM)
We identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency.
Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths.
arXiv Detail & Related papers (2023-11-07T19:00:56Z) - Emu: Enhancing Image Generation Models Using Photogenic Needles in a
Haystack [75.00066365801993]
Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text.
These pre-trained models often face challenges when it comes to generating highly aesthetic images.
We propose quality-tuning to guide a pre-trained model to exclusively generate highly visually appealing images.
arXiv Detail & Related papers (2023-09-27T17:30:19Z) - Likelihood-Based Text-to-Image Evaluation with Patch-Level Perceptual
and Semantic Credit Assignment [48.835298314274254]
We propose to evaluate text-to-image generation performance by directly estimating the likelihood of the generated images.
A higher likelihood indicates better perceptual quality and better text-image alignment.
It can successfully assess the generation ability of these models with as few as a hundred samples.
arXiv Detail & Related papers (2023-08-16T17:26:47Z) - ALL-E: Aesthetics-guided Low-light Image Enhancement [45.40896781156727]
We propose a new paradigm, i.e. aesthetics-guided low-light image enhancement (ALL-E)
It introduces aesthetic preferences to LLE and motivates training in a reinforcement learning framework with an aesthetic reward.
Our results on various benchmarks demonstrate the superiority of ALL-E over state-of-the-art methods.
arXiv Detail & Related papers (2023-04-28T03:34:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.