Fine-Tuning a Large Vision-Language Model for Artwork's Scoring and Critique
- URL: http://arxiv.org/abs/2602.13306v1
- Date: Mon, 09 Feb 2026 19:52:16 GMT
- Title: Fine-Tuning a Large Vision-Language Model for Artwork's Scoring and Critique
- Authors: Zhehan Zhang, Meihua Qian, Li Luo, Siyu Huang, Chaoyi Zhou, Ripon Saha, Xinxin Song,
- Abstract summary: We propose a framework for automated creativity assessment of human paintings by fine-tuning the vision-language model Qwen2-VL-7B with multi-task learning.<n>Our dataset contains 1000 human-created paintings scored on a 1-100 scale and paired with a short human-written description.<n> Experiments show strong accuracy, achieving Pearson r > 0.97 and MAE about 3.95 on the 100-point scale.
- Score: 11.787232686718367
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Assessing artistic creativity is foundational to creativity research and arts education, yet manual scoring (e.g., Torrance Tests of Creative Thinking) is labor-intensive at scale. Prior machine-learning approaches show promise for visual creativity scoring, but many rely mainly on image features and provide limited or no explanatory feedback. We propose a framework for automated creativity assessment of human paintings by fine-tuning the vision-language model Qwen2-VL-7B with multi-task learning. Our dataset contains 1000 human-created paintings scored on a 1-100 scale and paired with a short human-written description (content or artist explanation). Two expert raters evaluated each work using a five-dimension rubric (originality, color, texture, composition, content) and provided written critiques; we use an 80/20 train-test split. We add a lightweight regression head on the visual encoder output so the model can predict a numerical score and generate rubric-aligned feedback in a single forward pass. By embedding the structured rubric and the artwork description in the system prompt, we constrain the generated text to match the quantitative prediction. Experiments show strong accuracy, achieving Pearson r > 0.97 and MAE about 3.95 on the 100-point scale. Qualitative evaluation indicates the generated feedback is semantically close to expert critiques (average SBERT cosine similarity = 0.798). The proposed approach bridges computer vision and art assessment and offers a scalable tool for creativity research and classroom feedback.
Related papers
- KidsArtBench: Multi-Dimensional Children's Art Evaluation with Attribute-Aware MLLMs [13.1845557800464]
We introduce KidsArtBench, a new benchmark of over 1k children's artworks (ages 5-15) annotated by 12 expert educators across 9 rubric-aligned dimensions.<n>KidsArtBench targets children's artwork and pairs multi-dimensional annotations with comment supervision to enable both ordinal assessment and formative feedback.
arXiv Detail & Related papers (2025-12-14T00:24:48Z) - Simple Lines, Big Ideas: Towards Interpretable Assessment of Human Creativity from Drawings [18.09092203643732]
We propose a data-driven framework for automatic and interpretable creativity assessment from drawings.<n>Motivated by the cognitive evidence proposed in [6] that creativity can emerge from both what is drawn (content) and how it is drawn (style), we reinterpret the creativity score as a function of these two complementary dimensions.
arXiv Detail & Related papers (2025-11-17T02:16:01Z) - Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment [4.334576480811837]
We propose a novel curiosity-driven LLM-as-a-judge for evaluating creative writing.<n>Our method is especially useful in subjective evaluations where not all the annotators agree with each other.
arXiv Detail & Related papers (2025-10-01T04:29:36Z) - TraitSpaces: Towards Interpretable Visual Creativity for Human-AI Co-Creation [0.0]
Drawing on interviews with practicing artists and theories from psychology, we define 12 traits that capture affective, symbolic, cultural, and ethical dimensions of creativity.<n>Traits such as Environmental Dialogicity and Redemptive Arc are predicted with high reliability.<n>By linking cultural-aesthetic insights with computational modeling, our work aims not to reduce creativity to numbers, but to offer shared language and interpretable tools for artists, researchers, and AI systems to collaborate meaningfully.
arXiv Detail & Related papers (2025-09-29T06:24:18Z) - Towards More Accurate Personalized Image Generation: Addressing Overfitting and Evaluation Bias [52.590072198551944]
The aim of image personalization is to create images based on a user-provided subject.<n>Current methods face challenges in ensuring fidelity to the text prompt.<n>We introduce a novel training pipeline that incorporates an attractor to filter out distractions in training images.
arXiv Detail & Related papers (2025-03-09T14:14:02Z) - APDDv2: Aesthetics of Paintings and Drawings Dataset with Artist Labeled Scores and Comments [45.57709215036539]
We introduce the Aesthetics Paintings and Drawings dataset (APDD), the first comprehensive collection of paintings encompassing 24 distinct artistic categories and 10 aesthetic attributes.
APDDv2 boasts an expanded image corpus and improved annotation quality, featuring detailed language comments.
We present an updated version of the Art Assessment Network for Specific Painting Styles, denoted as ArtCLIP. Experimental validation demonstrates the superior performance of this revised model in the realm of aesthetic evaluation, surpassing its predecessor in accuracy and efficacy.
arXiv Detail & Related papers (2024-11-13T11:46:42Z) - Learning Transferable Pedestrian Representation from Multimodal
Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information.
We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations.
We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z) - VILA: Learning Image Aesthetics from User Comments with Vision-Language
Pretraining [53.470662123170555]
We propose learning image aesthetics from user comments, and exploring vision-language pretraining methods to learn multimodal aesthetic representations.
Specifically, we pretrain an image-text encoder-decoder model with image-comment pairs, using contrastive and generative objectives to learn rich and generic aesthetic semantics without human labels.
Our results show that our pretrained aesthetic vision-language model outperforms prior works on image aesthetic captioning over the AVA-Captions dataset.
arXiv Detail & Related papers (2023-03-24T23:57:28Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z) - Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object
Interactions [138.49522643425334]
Bongard-HOI is a new visual reasoning benchmark that focuses on compositional learning of human-object interactions from natural images.
It is inspired by two desirable characteristics from the classical Bongard problems (BPs): 1) few-shot concept learning, and 2) context-dependent reasoning.
Bongard-HOI presents a substantial challenge to today's visual recognition models.
arXiv Detail & Related papers (2022-05-27T07:36:29Z) - DALL-Eval: Probing the Reasoning Skills and Social Biases of
Text-to-Image Generation Models [73.12069620086311]
We investigate the visual reasoning capabilities and social biases of text-to-image models.
First, we measure three visual reasoning skills: object recognition, object counting, and spatial relation understanding.
Second, we assess the gender and skin tone biases by measuring the gender/skin tone distribution of generated images.
arXiv Detail & Related papers (2022-02-08T18:36:52Z) - ProtoTransformer: A Meta-Learning Approach to Providing Student Feedback [54.142719510638614]
In this paper, we frame the problem of providing feedback as few-shot classification.
A meta-learner adapts to give feedback to student code on a new programming question from just a few examples by instructors.
Our approach was successfully deployed to deliver feedback to 16,000 student exam-solutions in a programming course offered by a tier 1 university.
arXiv Detail & Related papers (2021-07-23T22:41:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.