Image Aesthetic Reasoning via HCM-GRPO: Empowering Compact Model for Superior Performance
- URL: http://arxiv.org/abs/2511.10055v1
- Date: Fri, 14 Nov 2025 01:29:19 GMT
- Title: Image Aesthetic Reasoning via HCM-GRPO: Empowering Compact Model for Superior Performance
- Authors: Zhiyuan Hu, Zheng Sun, Yi Wei, Long Yu,
- Abstract summary: We study the performance of image screening with Multimodal Large Language Models (MLLMs)<n>For data, we collect a comprehensive image screening dataset with over 128k samples, about 640k images.<n>The dataset evaluates the image aesthetic reasoning ability under four aspects: appearance deformation, physical shadow, placement layout, and extension.<n>Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning.
- Score: 17.319552703367567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The performance of image generation has been significantly improved in recent years. However, the study of image screening is rare and its performance with Multimodal Large Language Models (MLLMs) is unsatisfactory due to the lack of data and the weak image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive image screening dataset with over 128k samples, about 640k images. Each sample consists of an original image, four generated images. The dataset evaluates the image aesthetic reasoning ability under four aspects: appearance deformation, physical shadow, placement layout, and extension rationality. Regarding data annotation, we investigate multiple approaches, including purely manual, fully automated, and answer-driven annotations, to acquire high-quality chains of thought (CoT) data in the most cost-effective manner. Methodologically, we introduce a Hard Cases Mining (HCM) strategy with a Dynamic Proportional Accuracy (DPA) reward into the Group Relative Policy Optimization (GRPO) framework, called HCM-GRPO. This enhanced method demonstrates superior image aesthetic reasoning capabilities compared to the original GRPO. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the HCM-GRPO, we are able to surpass the scores of both large-scale open-source and leading closed-source models with a much smaller model.
Related papers
- More Images, More Problems? A Controlled Analysis of VLM Failure Modes [80.64323947730905]
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored.<n>We introduce MIMIC, a new benchmark designed to rigorously evaluate the multi-image capabilities of LVLMs.
arXiv Detail & Related papers (2026-01-12T18:45:13Z) - Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs [20.222987035141646]
The study of image screening is rare and its performance with MLLMs is unsatisfactory due to the lack of data.<n>In this work, we propose a complete solution to address these problems in terms of data and methodology.
arXiv Detail & Related papers (2025-05-29T09:14:16Z) - Picking the Cream of the Crop: Visual-Centric Data Selection with Collaborative Agents [62.616106562146776]
We propose a textbfVisual-Centric textbfSelection approach via textbfAgents Collaboration (ViSA)<n>Our approach consists of 1) an image information quantification method via visual agents collaboration to select images with rich visual information, and 2) a visual-centric instruction quality assessment method to select high-quality instruction data related to high-quality images.
arXiv Detail & Related papers (2025-02-27T09:37:30Z) - Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [86.69947123512836]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks.<n>We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation.<n>We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z) - Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset.
We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6.
Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z) - Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark [63.296342841358815]
Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images.<n>The ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering.<n>We introduce MIRAGE, an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU.
arXiv Detail & Related papers (2024-07-18T17:59:30Z) - GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization [21.846935203845728]
We build a local manipulation data generation pipeline that integrates the powerful capabilities of SAM, LLM, and generative models.<n>We propose the GIM dataset, which has the following advantages: 1) Large scale, GIM includes over one million pairs of AI-manipulated images and real images.
arXiv Detail & Related papers (2024-06-24T11:10:41Z) - DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception [66.88792390480343]
We propose DEEM, a simple but effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder.<n>DEEM exhibits enhanced robustness and a superior capacity to alleviate model hallucinations while utilizing fewer trainable parameters, less pre-training data, and a smaller base model size.
arXiv Detail & Related papers (2024-05-24T05:46:04Z) - On quantifying and improving realism of images generated with diffusion [50.37578424163951]
We propose a metric, called Image Realism Score (IRS), computed from five statistical measures of a given image.
IRS is easily usable as a measure to classify a given image as real or fake.
We experimentally establish the model- and data-agnostic nature of the proposed IRS by successfully detecting fake images generated by Stable Diffusion Model (SDM), Dalle2, Midjourney and BigGAN.
Our efforts have also led to Gen-100 dataset, which provides 1,000 samples for 100 classes generated by four high-quality models.
arXiv Detail & Related papers (2023-09-26T08:32:55Z) - CHIMLE: Conditional Hierarchical IMLE for Multimodal Conditional Image
Synthesis [5.7789164588489035]
A persistent challenge in conditional image synthesis has been to generate diverse output images from the same input image.
We leverage Implicit Conditional Likelihood Estimation Maximum (IMLE) which can overcome mode collapse.
To generate high-fidelity images, prior IMLE-based methods require a large number of samples, which is expensive.
We show CHIMLE significantly outperforms the prior best IMLE, GAN and diffusion-based methods in terms of image fidelity and mode coverage.
arXiv Detail & Related papers (2022-11-25T18:41:44Z) - Optimizing Hierarchical Image VAEs for Sample Quality [0.0]
hierarchical variational autoencoders (VAEs) have achieved great density estimation on image modeling tasks.
We attribute this to learned representations that over-emphasize compressing imperceptible details of the image.
We introduce a KL-reweighting strategy to control the amount of infor mation in each latent group, and employ a Gaussian output layer to reduce sharpness in the learning objective.
arXiv Detail & Related papers (2022-10-18T23:10:58Z) - Perceptual Image Restoration with High-Quality Priori and Degradation
Learning [28.93489249639681]
We show that our model performs well in measuring the similarity between restored and degraded images.
Our simultaneous restoration and enhancement framework generalizes well to real-world complicated degradation types.
arXiv Detail & Related papers (2021-03-04T13:19:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.