Related papers: Human Preference-Aligned Concept Customization Benchmark via Decomposed Evaluation

Human Preference-Aligned Concept Customization Benchmark via Decomposed Evaluation

URL: http://arxiv.org/abs/2509.03385v1
Date: Wed, 03 Sep 2025 15:02:40 GMT
Title: Human Preference-Aligned Concept Customization Benchmark via Decomposed Evaluation
Authors: Reina Ishikawa, Ryo Fujii, Hideo Saito, Ryo Hachiuma,
Abstract summary: We propose Decomposed GPT Score (D-GPTScore), a novel human-aligned evaluation method.<n>We release Human Preference-Aligned Concept Customization Benchmark (CC-AlignBench), a benchmark dataset.<n>Our method significantly outperforms existing approaches on this benchmark, exhibiting higher correlation with human preferences.
Score: 19.889844251026542
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Evaluating concept customization is challenging, as it requires a comprehensive assessment of fidelity to generative prompts and concept images. Moreover, evaluating multiple concepts is considerably more difficult than evaluating a single concept, as it demands detailed assessment not only for each individual concept but also for the interactions among concepts. While humans can intuitively assess generated images, existing metrics often provide either overly narrow or overly generalized evaluations, resulting in misalignment with human preference. To address this, we propose Decomposed GPT Score (D-GPTScore), a novel human-aligned evaluation method that decomposes evaluation criteria into finer aspects and incorporates aspect-wise assessments using Multimodal Large Language Model (MLLM). Additionally, we release Human Preference-Aligned Concept Customization Benchmark (CC-AlignBench), a benchmark dataset containing both single- and multi-concept tasks, enabling stage-wise evaluation across a wide difficulty range -- from individual actions to multi-person interactions. Our method significantly outperforms existing approaches on this benchmark, exhibiting higher correlation with human preferences. This work establishes a new standard for evaluating concept customization and highlights key challenges for future research. The benchmark and associated materials are available at https://github.com/ReinaIshikawa/D-GPTScore.

Related papers

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem [87.30601926271864]
InnoEval is a deep innovation evaluation framework designed to emulate human-level idea assessment.<n>We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources.<n>We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval.
arXiv Detail & Related papers (2026-02-16T00:40:31Z)
A Theoretical Framework for Adaptive Utility-Weighted Benchmarking [0.0]
This paper introduces a theoretical framework that reconceptualizes benchmarking as a multilayer, adaptive network linking evaluation metrics, model components, and stakeholder groups through weighted interactions.<n>Using conjoint-derived utilities and a human-in-the-loop update rule, we formalize how human tradeoffs can be embedded into benchmark structure and how benchmarks can evolve dynamically while preserving stability and interpretability.
arXiv Detail & Related papers (2026-02-12T19:33:47Z)
Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models [118.44328586173556]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks.<n>Human-MME is a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding.<n>Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding.
arXiv Detail & Related papers (2025-09-30T12:20:57Z)
From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback [36.68929551237421]
We introduce bftextFeedbacker, an evaluation framework that provides comprehensive and fine-grained results.<n>Our project homepage and dataset are available at https://liudan193.io/Feedbacker.
arXiv Detail & Related papers (2025-05-10T16:52:40Z)
Bayesian Active Learning for Multi-Criteria Comparative Judgement in Educational Assessment [2.443343861973814]
Comparative Judgement (CJ) provides an alternative assessment approach by evaluating work holistically rather than breaking it into discrete criteria.<n>This method leverages human ability to make nuanced comparisons, yielding more reliable and valid assessments.<n> rubrics remain widely used in education, offering structured criteria for grading and detailed feedback.<n>This creates a gap between CJ's holistic ranking and the need for criterion-based performance breakdowns.
arXiv Detail & Related papers (2025-03-01T13:12:41Z)
SedarEval: Automated Evaluation using Self-Adaptive Rubrics [4.97150240417381]
We propose a new evaluation paradigm based on self-adaptive rubrics.<n>SedarEval consists of 1,000 meticulously crafted questions, each with its own self-adaptive rubric.<n>We train a specialized evaluator language model (evaluator LM) to supplant human graders.
arXiv Detail & Related papers (2025-01-26T16:45:09Z)
HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)<n>In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.<n>We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z)
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z)
What Are We Optimizing For? A Human-centric Evaluation of Deep Learning-based Movie Recommenders [12.132920692489911]
We conduct a human-centric evaluation case study of four leading DL-RecSys models in the movie domain. We test how different DL-RecSys models perform in personalized recommendation generation by conducting survey study with 445 real active users. We find some DL-RecSys models to be superior in recommending novel and unexpected items and weaker in diversity, trustworthiness, transparency, accuracy, and overall user satisfaction.
arXiv Detail & Related papers (2024-01-21T23:56:57Z)
Learning and Evaluating Human Preferences for Conversational Head Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions. PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z)
MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models. MMBench is meticulously curated with well-designed quality control schemes. MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z)
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale. We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units. We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z)
Enriching ImageNet with Human Similarity Judgments and Psychological Embeddings [7.6146285961466]
We introduce a dataset that embodies the task-general capabilities of human perception and reasoning. The Human Similarity Judgments extension to ImageNet (ImageNet-HSJ) is composed of human similarity judgments. The new dataset supports a range of task and performance metrics, including the evaluation of unsupervised learning algorithms.
arXiv Detail & Related papers (2020-11-22T13:41:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.