Related papers: Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered

Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered

URL: http://arxiv.org/abs/2603.00643v1
Date: Sat, 28 Feb 2026 13:24:34 GMT
Title: Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered
Authors: Jinfan Hu, Fanghua Yu, Zhiyuan You, Xiang Yin, Hongyu An, Xinqi Lin, Chao Dong, Jinjin Gu,
Abstract summary: This position paper argues that the evaluation of modern visual processing systems should no longer be driven primarily by single-metric image quality assessment benchmarks.<n>Rather than rejecting metrics altogether, this paper calls for a rebalancing of evaluation paradigms, advocating a more human-centered, context-aware, and fine-grained approach to assessing the visual models' outcomes.
Score: 34.408989226550176
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This position paper argues that the evaluation of modern visual processing systems should no longer be driven primarily by single-metric image quality assessment benchmarks, particularly in the era of generative and perception-oriented methods. Image restoration exemplifies this divergence: while objective IQA metrics enable reproducible, scalable evaluation, they have increasingly drifted apart from human perception and user preferences. We contend that this mismatch risks constraining innovation and misguiding research progress across visual processing tasks. Rather than rejecting metrics altogether, this paper calls for a rebalancing of evaluation paradigms, advocating a more human-centered, context-aware, and fine-grained approach to assessing the visual models' outcomes.

Related papers

Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach [0.0]
This paper integrates psychometric rater models into the AI pipeline to improve the reliability and validity of conclusions drawn from human judgments.<n>We show how adjusting for rater severity produces corrected estimates of summary quality.<n>This perspective highlights a path toward more robust, interpretable, and construct-aligned practices for AI development and evaluation.
arXiv Detail & Related papers (2026-02-26T03:35:36Z)
Beyond Static Evaluation: Rethinking the Assessment of Personalized Agent Adaptability in Information Retrieval [12.058221341033835]
We propose a conceptual lens for rethinking evaluation in adaptive personalization.<n>We organize this lens around three core components: (1) persona-based user simulation with temporally evolving preference models; (2) structured elicitation protocols inspired by reference interviews to extract preferences in context; and (3) adaptation-aware evaluation mechanisms that measure how agent behavior improves across sessions and tasks.
arXiv Detail & Related papers (2025-10-05T00:35:37Z)
A Picture is Worth a Thousand Prompts? Efficacy of Iterative Human-Driven Prompt Refinement in Image Regeneration Tasks [1.8563642867160601]
The creation of AI-generated images often involves refining the input prompt iteratively to achieve desired visual outcomes.<n>This study focuses on the relatively underexplored concept of image regeneration using AI.<n>We present a structured user study evaluating how iterative prompt refinement affects the similarity of regenerated images relative to their targets.
arXiv Detail & Related papers (2025-04-29T01:21:16Z)
Towards Automatic Evaluation for Image Transcreation [52.71090829502756]
We propose a suite of automatic evaluation metrics inspired by machine translation (MT) metrics.<n>We identify three critical dimensions of image transcreation: cultural relevance, semantic equivalence and visual similarity.<n>Our results show that proprietary VLMs best identify cultural relevance and semantic equivalence, while vision-encoder representations are adept at measuring visual similarity.
arXiv Detail & Related papers (2024-12-18T10:55:58Z)
Learning and Evaluating Human Preferences for Conversational Head Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions. PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
TISE: A Toolbox for Text-to-Image Synthesis Evaluation [9.092600296992925]
We conduct a study on state-of-the-art methods for single- and multi-object text-to-image synthesis. We propose a common framework for evaluating these methods.
arXiv Detail & Related papers (2021-12-02T16:39:35Z)
Pros and Cons of GAN Evaluation Measures: New Developments [53.10151901863263]
This work is an update of a previous paper on the same topic published a few years ago. I describe new dimensions that are becoming important in assessing models, and discuss the connection between GAN evaluation and deepfakes.
arXiv Detail & Related papers (2021-03-17T01:48:34Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.