Related papers: Modeling Art Evaluations from Comparative Judgments: A Deep Learning Approach to Predicting Aesthetic Preferences

Modeling Art Evaluations from Comparative Judgments: A Deep Learning Approach to Predicting Aesthetic Preferences

URL: http://arxiv.org/abs/2602.00394v1
Date: Fri, 30 Jan 2026 23:13:06 GMT
Title: Modeling Art Evaluations from Comparative Judgments: A Deep Learning Approach to Predicting Aesthetic Preferences
Authors: Manoj Reddy Bethi, Sai Rupa Jhade, Pravallika Yaganti, Monoshiz Mahbub Khan, Zhe Yu,
Abstract summary: Law of Comparative Judgment posits that relative choices exhibit less cognitive burden and greater cognitive consistency than direct scoring.<n>We develop a deep neural network regression model and a dual-branch pairwise comparison model.<n>Human subject experiments reveal that comparative judgments require $60%$ less annotation time per item.
Score: 1.839031891198526
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modeling human aesthetic judgments in visual art presents significant challenges due to individual preference variability and the high cost of obtaining labeled data. To reduce cost of acquiring such labels, we propose to apply a comparative learning framework based on pairwise preference assessments rather than direct ratings. This approach leverages the Law of Comparative Judgment, which posits that relative choices exhibit less cognitive burden and greater cognitive consistency than direct scoring. We extract deep convolutional features from painting images using ResNet-50 and develop both a deep neural network regression model and a dual-branch pairwise comparison model. We explored four research questions: (RQ1) How does the proposed deep neural network regression model with CNN features compare to the baseline linear regression model using hand-crafted features? (RQ2) How does pairwise comparative learning compare to regression-based prediction when lacking access to direct rating values? (RQ3) Can we predict individual rater preferences through within-rater and cross-rater analysis? (RQ4) What is the annotation cost trade-off between direct ratings and comparative judgments in terms of human time and effort? Our results show that the deep regression model substantially outperforms the baseline, achieving up to $328\%$ improvement in $R^2$. The comparative model approaches regression performance despite having no access to direct rating values, validating the practical utility of pairwise comparisons. However, predicting individual preferences remains challenging, with both within-rater and cross-rater performance significantly lower than average rating prediction. Human subject experiments reveal that comparative judgments require $60\%$ less annotation time per item, demonstrating superior annotation efficiency for large-scale preference modeling.

Related papers

Modeling Image-Caption Rating from Comparative Judgments [8.460083530922931]
We propose a machine learning framework that models such comparative judgments instead of direct ratings.<n>The model can then be applied to rank unseen image-caption pairs in the same way as a regression model trained on direct ratings.
arXiv Detail & Related papers (2026-01-30T23:00:07Z)
How Many Human Judgments Are Enough? Feasibility Limits of Human Preference Evaluation [0.38991526486631006]
We show that when preference signal is diffuse across prompts, proportional allocation is minimax-optimal.<n>Our results show that inconclusive or negative human evaluation outcomes frequently reflect underpowered evaluation rather than model equivalence.
arXiv Detail & Related papers (2026-01-14T02:34:58Z)
Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models [12.445845925904466]
Language models serve as proxies for human preference judgements in alignment and evaluation.<n>They exhibit systematic miscalibration, prioritizing superficial patterns over substantive qualities.<n>This bias manifests as overreliance on features like length, structure, and style, leading to issues like reward hacking and unreliable evaluations.
arXiv Detail & Related papers (2025-06-05T17:59:32Z)
RewardBench 2: Advancing Reward Model Evaluation [71.65938693914153]
Reward models are used throughout the post-training of language models to capture nuanced signals from preference data.<n>The community has begun establishing best practices for evaluating reward models.<n>This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark.
arXiv Detail & Related papers (2025-06-02T17:54:04Z)
Learning Guarantee of Reward Modeling Using Deep Neural Networks [1.1088875073103415]
We study the learning theory of reward modeling with pairwise comparison data using deep neural networks.<n>We establish a novel non-asymptotic regret bound for deep reward estimators in a non-parametric setting.
arXiv Detail & Related papers (2025-05-10T11:21:29Z)
Rethinking Relation Extraction: Beyond Shortcuts to Generalization with a Debiased Benchmark [53.876493664396506]
Benchmarks are crucial for evaluating machine learning algorithm performance, facilitating comparison and identifying superior solutions.<n>This paper addresses the issue of entity bias in relation extraction tasks, where models tend to rely on entity mentions rather than context.<n>We propose a debiased relation extraction benchmark DREB that breaks the pseudo-correlation between entity mentions and relation types through entity replacement.<n>To establish a new baseline on DREB, we introduce MixDebias, a debiasing method combining data-level and model training-level techniques.
arXiv Detail & Related papers (2025-01-02T17:01:06Z)
Linked shrinkage to improve estimation of interaction effects in regression models [0.0]
We develop an estimator that adapts well to two-way interaction terms in a regression model. We evaluate the potential of the model for inference, which is notoriously hard for selection strategies. Our models can be very competitive to a more advanced machine learner, like random forest, even for fairly large sample sizes.
arXiv Detail & Related papers (2023-09-25T10:03:39Z)
GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models. We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench. GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z)
Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs) We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z)
Group-aware Contrastive Regression for Action Quality Assessment [85.43203180953076]
We show that the relations among videos can provide important clues for more accurate action quality assessment. Our approach outperforms previous methods by a large margin and establishes new state-of-the-art on all three benchmarks.
arXiv Detail & Related papers (2021-08-17T17:59:39Z)
Learning Expectation of Label Distribution for Facial Age and Attractiveness Estimation [65.5880700862751]
We analyze the essential relationship between two state-of-the-art methods (Ranking-CNN and DLDL) and show that the Ranking method is in fact learning label distribution implicitly. We propose a lightweight network architecture and propose a unified framework which can jointly learn facial attribute distribution and regress attribute value. Our method achieves new state-of-the-art results using the single model with 36$times$ fewer parameters and 3$times$ faster inference speed on facial age/attractiveness estimation.
arXiv Detail & Related papers (2020-07-03T15:46:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.