Related papers: Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap

Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap

URL: http://arxiv.org/abs/2508.18646v1
Date: Tue, 26 Aug 2025 03:43:05 GMT
Title: Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap
Authors: Jun Wang, Ninglun Gu, Kailai Zhang, Zijiao Zhang, Yelun Bao, Jin Yang, Xu Yin, Liwei Liu, Yihuan Liu, Pengyong Li, Gary G. Yen, Junchi Yan,
Abstract summary: This survey introduces an anthropomorphic evaluation paradigm through the lens of human intelligence.<n>For practical value, we pioneer a Value-oriented Evaluation (VQ) framework assessing economic viability, social impact, ethical alignment, and environmental sustainability.
Score: 44.608160256874726
License: http://creativecommons.org/licenses/by/4.0/
Abstract: For Large Language Models (LLMs), a disconnect persists between benchmark performance and real-world utility. Current evaluation frameworks remain fragmented, prioritizing technical metrics while neglecting holistic assessment for deployment. This survey introduces an anthropomorphic evaluation paradigm through the lens of human intelligence, proposing a novel three-dimensional taxonomy: Intelligence Quotient (IQ)-General Intelligence for foundational capacity, Emotional Quotient (EQ)-Alignment Ability for value-based interactions, and Professional Quotient (PQ)-Professional Expertise for specialized proficiency. For practical value, we pioneer a Value-oriented Evaluation (VQ) framework assessing economic viability, social impact, ethical alignment, and environmental sustainability. Our modular architecture integrates six components with an implementation roadmap. Through analysis of 200+ benchmarks, we identify key challenges including dynamic assessment needs and interpretability gaps. It provides actionable guidance for developing LLMs that are technically proficient, contextually relevant, and ethically sound. We maintain a curated repository of open-source evaluation resources at: https://github.com/onejune2018/Awesome-LLM-Eval.

Related papers

HeartBench: Probing Core Dimensions of Anthropomorphic Intelligence in LLMs [20.794341575633503]
HeartBench is a framework designed to evaluate the integrated emotional, cultural, and ethical dimensions of Chinese Large Language Models (LLMs)<n>Even leading models achieve only 60% of the expert-defined ideal score.<n>Analysis using a difficulty-stratified Hard Set'' reveals a significant performance decay in scenarios involving subtle emotional subtexts and complex ethical trade-offs.
arXiv Detail & Related papers (2025-12-26T03:54:56Z)
AgenticIQA: An Agentic Framework for Adaptive and Interpretable Image Quality Assessment [69.06977852423564]
Image quality assessment (IQA) reflects both the quantification and interpretation of perceptual quality rooted in the human visual system.<n>AgenticIQA decomposes IQA into four subtasks -- distortion detection, distortion analysis, tool selection, and tool execution.<n>To support training and evaluation, we introduce AgenticIQA-200K, a large-scale instruction dataset tailored for IQA agents, and AgenticIQA-Eval, the first benchmark for assessing the planning, execution, and summarization capabilities of VLM-based IQA agents.
arXiv Detail & Related papers (2025-09-30T09:37:01Z)
EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World? [52.99661576320663]
multimodal large language models (MLLMs) have driven breakthroughs in egocentric vision applications.<n>EOC-Bench is an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocentric scenarios.<n>We conduct comprehensive evaluations of various proprietary, open-source, and object-level MLLMs based on EOC-Bench.
arXiv Detail & Related papers (2025-06-05T17:44:12Z)
MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks [0.0]
We propose MEQA, a framework for the meta-evaluation of question and answer (QA) benchmarks.<n>We demonstrate this approach on cybersecurity benchmarks, using human and LLM evaluators.<n>We motivate our choice of test domain by AI models' dual nature as powerful defensive tools and security threats.
arXiv Detail & Related papers (2025-04-18T19:01:53Z)
A Comprehensive Survey of Action Quality Assessment: Method and Benchmark [25.694556140797832]
Action Quality Assessment (AQA) quantitatively evaluates the quality of human actions, providing automated assessments that reduce biases in human judgment.<n>Recent advances in AQA have introduced innovative methodologies, but similar methods often intertwine across different domains.<n>The lack of a unified benchmark and limited computational comparisons hinder consistent evaluation and fair assessment of AQA approaches.
arXiv Detail & Related papers (2024-12-15T10:47:26Z)
Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [52.76508734756661]
Auto-PRE is an automatic evaluation framework inspired by the peer review process.<n>Unlike previous approaches that rely on human annotations, Auto-PRE automatically selects evaluators based on three core traits.<n> Experiments on three representative tasks, including summarization, non-factoid QA, and dialogue generation, demonstrate that Auto-PRE achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-10-16T06:06:06Z)
TestAgent: A Framework for Domain-Adaptive Evaluation of LLMs via Dynamic Benchmark Construction and Exploratory Interaction [29.72874725703848]
Large language models (LLMs) are increasingly deployed to various vertical domains.<n>Current evaluation methods rely on static and resource-intensive datasets that are not aligned with real-world requirements.<n>We introduce two key concepts: textbfBenchmark+, which extends the traditional question-answer benchmark into a more flexible strategy-criterion'' format.<n>We propose textbftextscTestAgent, an agent-based evaluation framework that implements these concepts using retrieval-augmented generation and reinforcement learning.
arXiv Detail & Related papers (2024-10-15T11:20:42Z)
Towards Flexible Evaluation for Generative Visual Question Answering [17.271448204525612]
This paper proposes the use of semantics-based evaluators for assessing unconstrained open-ended responses on Visual Question Answering (VQA) datasets. In addition, this paper proposes a Semantically Flexible VQA Evaluator (SFVE) with meticulous design based on the unique features of VQA evaluation.
arXiv Detail & Related papers (2024-08-01T05:56:34Z)
Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment. To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z)
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.<n>A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z)
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs) This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation. By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z)
Post Turing: Mapping the landscape of LLM Evaluation [22.517544562890663]
This paper traces the historical trajectory of Large Language Models (LLMs) evaluations, from the foundational questions posed by Alan Turing to the modern era of AI research. We emphasize the pressing need for a unified evaluation system, given the broader societal implications of these models. This work serves as a call for the AI community to collaboratively address the challenges of LLM evaluation, ensuring their reliability, fairness, and societal benefit.
arXiv Detail & Related papers (2023-11-03T17:24:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.