Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment
- URL: http://arxiv.org/abs/2512.23413v2
- Date: Mon, 05 Jan 2026 02:31:47 GMT
- Title: Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment
- Authors: Henglin Liu, Nisha Huang, Chang Liu, Jiangpeng Yan, Huijuan Huang, Jixuan Ying, Tong-Yee Lee, Pengfei Wan, Xiangyang Ji,
- Abstract summary: aesthetic quality assessment task is crucial for developing a human-aligned quantitative evaluation system for AIGC.<n>We propose ArtQuant, an aesthetics assessment framework for artistic images which couples isolated aesthetic dimensions through description generation.<n>Our approach achieves epoch state-of-the-art performance on several datasets while requiring only 33% of conventional trainings.
- Score: 51.40989269202702
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The aesthetic quality assessment task is crucial for developing a human-aligned quantitative evaluation system for AIGC. However, its inherently complex nature, spanning visual perception, cognition, and emotion, poses fundamental challenges. Although aesthetic descriptions offer a viable representation of this complexity, two critical challenges persist: (1) data scarcity and imbalance: existing dataset overly focuses on visual perception and neglects deeper dimensions due to the expensive manual annotation; and (2) model fragmentation: current visual networks isolate aesthetic attributes with multi-branch encoder, while multimodal methods represented by contrastive learning struggle to effectively process long-form textual descriptions. To resolve challenge (1), we first present the Refined Aesthetic Description (RAD) dataset, a large-scale (70k), multi-dimensional structured dataset, generated via an iterative pipeline without heavy annotation costs and easy to scale. To address challenge (2), we propose ArtQuant, an aesthetics assessment framework for artistic images which not only couples isolated aesthetic dimensions through joint description generation, but also better models long-text semantics with the help of LLM decoders. Besides, theoretical analysis confirms this symbiosis: RAD's semantic adequacy (data) and generation paradigm (model) collectively minimize prediction entropy, providing mathematical grounding for the framework. Our approach achieves state-of-the-art performance on several datasets while requiring only 33% of conventional training epochs, narrowing the cognitive gap between artistic images and aesthetic judgment. We will release both code and dataset to support future research.
Related papers
- Through the PRISm: Importance-Aware Scene Graphs for Image Retrieval [6.804414686833417]
PRISm is a multimodal framework that advances image-to-image retrieval through two novel components.<n>The Importance Prediction Module identifies and retains the most critical objects and relational triplets within an image.<n>The Edge-Aware Graph Neural Network explicitly encodes relational structure and integrates global visual features to produce semantically informed image embeddings.
arXiv Detail & Related papers (2025-12-20T15:57:46Z) - Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval [15.126709823382539]
This work advances Contrastive Language-Image Pre-training (CLIP) for person representation learning.<n>We develop a noise-resistant data construction pipeline that leverages the in-context learning capabilities of MLLMs.<n>We introduce the GA-DMS framework, which improves cross-modal alignment by adaptively masking noisy textual tokens.
arXiv Detail & Related papers (2025-09-11T03:06:22Z) - "Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z) - Advancing Comprehensive Aesthetic Insight with Multi-Scale Text-Guided Self-Supervised Learning [14.405750888492735]
Image Aesthetic Assessment (IAA) is a vital and intricate task that entails analyzing and assessing an image's aesthetic values.<n>Traditional methods of IAA often concentrate on a single aesthetic task and suffer from inadequate labeled datasets.<n>We propose a comprehensive aesthetic MLLM capable of nuanced aesthetic insight.
arXiv Detail & Related papers (2024-12-16T16:35:35Z) - Retrieval-guided Cross-view Image Synthesis [3.7477511412024573]
Cross-view image synthesis presents significant challenges in establishing reliable correspondences.<n>We propose a retrieval-guided framework that reimagines how retrieval techniques can facilitate effective cross-view image synthesis.<n>Our work bridges information retrieval and synthesis tasks, offering insights into how retrieval techniques can address complex cross-domain synthesis challenges.
arXiv Detail & Related papers (2024-11-29T07:04:44Z) - Deep ContourFlow: Advancing Active Contours with Deep Learning [3.9948520633731026]
We present a framework for both unsupervised and one-shot approaches for image segmentation.
It is capable of capturing complex object boundaries without the need for extensive labeled training data.
This is particularly required in histology, a field facing a significant shortage of annotations.
arXiv Detail & Related papers (2024-07-15T13:12:34Z) - Visual Analytics for Efficient Image Exploration and User-Guided Image
Captioning [35.47078178526536]
Recent advancements in pre-trained large-scale language-image models have ushered in a new era of visual comprehension.
This paper tackles two well-known issues within the realm of visual analytics: (1) the efficient exploration of large-scale image datasets and identification of potential data biases within them; (2) the evaluation of image captions and steering of their generation process.
arXiv Detail & Related papers (2023-11-02T06:21:35Z) - Deep Co-Attention Network for Multi-View Subspace Learning [73.3450258002607]
We propose a deep co-attention network for multi-view subspace learning.
It aims to extract both the common information and the complementary information in an adversarial setting.
In particular, it uses a novel cross reconstruction loss and leverages the label information to guide the construction of the latent representation.
arXiv Detail & Related papers (2021-02-15T18:46:44Z) - Deep Partial Multi-View Learning [94.39367390062831]
We propose a novel framework termed Cross Partial Multi-View Networks (CPM-Nets)
We fifirst provide a formal defifinition of completeness and versatility for multi-view representation.
We then theoretically prove the versatility of the learned latent representations.
arXiv Detail & Related papers (2020-11-12T02:29:29Z) - Learning Deformable Image Registration from Optimization: Perspective,
Modules, Bilevel Training and Beyond [62.730497582218284]
We develop a new deep learning based framework to optimize a diffeomorphic model via multi-scale propagation.
We conduct two groups of image registration experiments on 3D volume datasets including image-to-atlas registration on brain MRI data and image-to-image registration on liver CT data.
arXiv Detail & Related papers (2020-04-30T03:23:45Z) - Two-shot Spatially-varying BRDF and Shape Estimation [89.29020624201708]
We propose a novel deep learning architecture with a stage-wise estimation of shape and SVBRDF.
We create a large-scale synthetic training dataset with domain-randomized geometry and realistic materials.
Experiments on both synthetic and real-world datasets show that our network trained on a synthetic dataset can generalize well to real-world images.
arXiv Detail & Related papers (2020-04-01T12:56:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.