KidsArtBench: Multi-Dimensional Children's Art Evaluation with Attribute-Aware MLLMs
- URL: http://arxiv.org/abs/2512.12503v1
- Date: Sun, 14 Dec 2025 00:24:48 GMT
- Title: KidsArtBench: Multi-Dimensional Children's Art Evaluation with Attribute-Aware MLLMs
- Authors: Mingrui Ye, Chanjin Zheng, Zengyi Yu, Chenyu Xiang, Zhixue Zhao, Zheng Yuan, Helen Yannakoudakis,
- Abstract summary: We introduce KidsArtBench, a new benchmark of over 1k children's artworks (ages 5-15) annotated by 12 expert educators across 9 rubric-aligned dimensions.<n>KidsArtBench targets children's artwork and pairs multi-dimensional annotations with comment supervision to enable both ordinal assessment and formative feedback.
- Score: 13.1845557800464
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) show remarkable progress across many visual-language tasks; however, their capacity to evaluate artistic expression remains limited. Aesthetic concepts are inherently abstract and open-ended, and multimodal artwork annotations are scarce. We introduce KidsArtBench, a new benchmark of over 1k children's artworks (ages 5-15) annotated by 12 expert educators across 9 rubric-aligned dimensions, together with expert comments for feedback. Unlike prior aesthetic datasets that provide single scalar scores on adult imagery, KidsArtBench targets children's artwork and pairs multi-dimensional annotations with comment supervision to enable both ordinal assessment and formative feedback. Building on this resource, we propose an attribute-specific multi-LoRA approach, where each attribute corresponds to a distinct evaluation dimension (e.g., Realism, Imagination) in the scoring rubric, with Regression-Aware Fine-Tuning (RAFT) to align predictions with ordinal scales. On Qwen2.5-VL-7B, our method increases correlation from 0.468 to 0.653, with the largest gains on perceptual dimensions and narrowed gaps on higher-order attributes. These results show that educator-aligned supervision and attribute-aware training yield pedagogically meaningful evaluations and establish a rigorous testbed for sustained progress in educational AI. We release data and code with ethics documentation.
Related papers
- Fine-Tuning a Large Vision-Language Model for Artwork's Scoring and Critique [11.787232686718367]
We propose a framework for automated creativity assessment of human paintings by fine-tuning the vision-language model Qwen2-VL-7B with multi-task learning.<n>Our dataset contains 1000 human-created paintings scored on a 1-100 scale and paired with a short human-written description.<n> Experiments show strong accuracy, achieving Pearson r > 0.97 and MAE about 3.95 on the 100-point scale.
arXiv Detail & Related papers (2026-02-09T19:52:16Z) - KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old? [79.27736230305516]
We introduce KidVis, a novel benchmark grounded in the theory of human visual development.<n> evaluating 20 state-of-the-art MLLMs against a human physiological baseline reveals a stark performance disparity.<n>This study confirms that current MLLMs, despite their reasoning prowess, lack the essential physiological perceptual primitives required for generalized visual intelligence.
arXiv Detail & Related papers (2026-01-13T07:32:50Z) - Simple Lines, Big Ideas: Towards Interpretable Assessment of Human Creativity from Drawings [18.09092203643732]
We propose a data-driven framework for automatic and interpretable creativity assessment from drawings.<n>Motivated by the cognitive evidence proposed in [6] that creativity can emerge from both what is drawn (content) and how it is drawn (style), we reinterpret the creativity score as a function of these two complementary dimensions.
arXiv Detail & Related papers (2025-11-17T02:16:01Z) - TraitSpaces: Towards Interpretable Visual Creativity for Human-AI Co-Creation [0.0]
Drawing on interviews with practicing artists and theories from psychology, we define 12 traits that capture affective, symbolic, cultural, and ethical dimensions of creativity.<n>Traits such as Environmental Dialogicity and Redemptive Arc are predicted with high reliability.<n>By linking cultural-aesthetic insights with computational modeling, our work aims not to reduce creativity to numbers, but to offer shared language and interpretable tools for artists, researchers, and AI systems to collaborate meaningfully.
arXiv Detail & Related papers (2025-09-29T06:24:18Z) - ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding [32.55711618391249]
ArtiMuse is an innovative MLLM-based IAA model with Joint Scoring and Expert-Level Understanding capabilities.<n>ArtiMuse-10K is the first expert-curated image aesthetic dataset comprising 10,000 images spanning 5 main categories and 15 subcategories.
arXiv Detail & Related papers (2025-07-19T08:27:21Z) - Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art [61.28133495240179]
We propose a novel task of aesthetics alignment which seeks to align user-specified aesthetics with the T2I generation output.<n>Inspired by how artworks provide an invaluable perspective to approach aesthetics, we codify visual aesthetics using the compositional framework artists employ.<n>We demonstrate that T2I DMs can effectively offer 10 compositional controls through user-specified PoA conditions.
arXiv Detail & Related papers (2025-03-15T06:58:09Z) - Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark [53.61633384281524]
PolyMATH is a benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs.
The best scores achieved on PolyMATH are 41%, 36%, and 27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively.
A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning.
arXiv Detail & Related papers (2024-10-06T20:35:41Z) - AACP: Aesthetics assessment of children's paintings based on
self-supervised learning [17.672268781368672]
The Aesthetics Assessment of Children's Paintings (AACP) is an important branch of the image aesthetics assessment (IAA)
Previous approaches have relied on training large datasets and providing an aesthetics score to the image.
We construct an aesthetics assessment dataset of children's paintings and a model based on self-supervised learning.
arXiv Detail & Related papers (2024-03-12T12:07:00Z) - AesBench: An Expert Benchmark for Multimodal Large Language Models on
Image Aesthetics Perception [64.25808552299905]
AesBench is an expert benchmark aiming to comprehensively evaluate the aesthetic perception capacities of MLLMs.
We construct an Expert-labeled Aesthetics Perception Database (EAPD), which features diversified image contents and high-quality annotations provided by professional aesthetic experts.
We propose a set of integrative criteria to measure the aesthetic perception abilities of MLLMs from four perspectives, including Perception (AesP), Empathy (AesE), Assessment (AesA) and Interpretation (AesI)
arXiv Detail & Related papers (2024-01-16T10:58:07Z) - Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined
Levels [95.44077384918725]
We propose to teach large multi-modality models (LMMs) with text-defined rating levels instead of scores.
The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA) and video quality assessment (VQA) tasks.
arXiv Detail & Related papers (2023-12-28T16:10:25Z) - ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter [19.830089364830066]
ArtGPT-4 is a large vision-language model tailored to address the limitations of existing models in artistic comprehension.
It can render images with an artistic-understanding and convey the emotions they inspire, mirroring human interpretation.
arXiv Detail & Related papers (2023-05-12T14:04:30Z) - VILA: Learning Image Aesthetics from User Comments with Vision-Language
Pretraining [53.470662123170555]
We propose learning image aesthetics from user comments, and exploring vision-language pretraining methods to learn multimodal aesthetic representations.
Specifically, we pretrain an image-text encoder-decoder model with image-comment pairs, using contrastive and generative objectives to learn rich and generic aesthetic semantics without human labels.
Our results show that our pretrained aesthetic vision-language model outperforms prior works on image aesthetic captioning over the AVA-Captions dataset.
arXiv Detail & Related papers (2023-03-24T23:57:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.