Related papers: HeartBench: Probing Core Dimensions of Anthropomorphic Intelligence in LLMs

HeartBench: Probing Core Dimensions of Anthropomorphic Intelligence in LLMs

URL: http://arxiv.org/abs/2512.21849v1
Date: Fri, 26 Dec 2025 03:54:56 GMT
Title: HeartBench: Probing Core Dimensions of Anthropomorphic Intelligence in LLMs
Authors: Jiaxin Liu, Peiyi Tu, Wenyu Chen, Yihong Zhuang, Xinxia Ling, Anji Zhou, Chenxi Wang, Zhuo Han, Zhengkai Yang, Junbo Zhao, Zenan Huang, Yuanyuan Wang,
Abstract summary: HeartBench is a framework designed to evaluate the integrated emotional, cultural, and ethical dimensions of Chinese Large Language Models (LLMs)<n>Even leading models achieve only 60% of the expert-defined ideal score.<n>Analysis using a difficulty-stratified Hard Set'' reveals a significant performance decay in scenarios involving subtle emotional subtexts and complex ethical trade-offs.
Score: 20.794341575633503
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Large Language Models (LLMs) have achieved remarkable success in cognitive and reasoning benchmarks, they exhibit a persistent deficit in anthropomorphic intelligence-the capacity to navigate complex social, emotional, and ethical nuances. This gap is particularly acute in the Chinese linguistic and cultural context, where a lack of specialized evaluation frameworks and high-quality socio-emotional data impedes progress. To address these limitations, we present HeartBench, a framework designed to evaluate the integrated emotional, cultural, and ethical dimensions of Chinese LLMs. Grounded in authentic psychological counseling scenarios and developed in collaboration with clinical experts, the benchmark is structured around a theory-driven taxonomy comprising five primary dimensions and 15 secondary capabilities. We implement a case-specific, rubric-based methodology that translates abstract human-like traits into granular, measurable criteria through a ``reasoning-before-scoring'' evaluation protocol. Our assessment of 13 state-of-the-art LLMs indicates a substantial performance ceiling: even leading models achieve only 60% of the expert-defined ideal score. Furthermore, analysis using a difficulty-stratified ``Hard Set'' reveals a significant performance decay in scenarios involving subtle emotional subtexts and complex ethical trade-offs. HeartBench establishes a standardized metric for anthropomorphic AI evaluation and provides a methodological blueprint for constructing high-quality, human-aligned training data.

Related papers

The Linguistic Architecture of Reflective Thought: Evaluation of a Large Language Model as a Tool to Isolate the Formal Structure of Mentalization [0.0]
Mentalization integrates cognitive, affective, and intersubjective components.<n>Large Language Models (LLMs) display an increasing ability to generate reflective texts.
arXiv Detail & Related papers (2025-11-20T23:51:34Z)
Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models [118.44328586173556]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks.<n>Human-MME is a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding.<n>Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding.
arXiv Detail & Related papers (2025-09-30T12:20:57Z)
11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis [54.24689751375923]
This work introduces a systematic evaluation framework to assess the spatial reasoning abilities of state-of-the-art MLLMs.<n>Through experiments across 14 MLLMs and human evaluation, we find that current MLLMs exhibit early signs of spatial cognition.<n>These findings highlight both emerging capabilities and limitations in current MLLMs' spatial reasoning capabilities.
arXiv Detail & Related papers (2025-08-27T17:22:34Z)
Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap [44.608160256874726]
This survey introduces an anthropomorphic evaluation paradigm through the lens of human intelligence.<n>For practical value, we pioneer a Value-oriented Evaluation (VQ) framework assessing economic viability, social impact, ethical alignment, and environmental sustainability.
arXiv Detail & Related papers (2025-08-26T03:43:05Z)
A Computational Framework to Identify Self-Aspects in Text [9.187473897664105]
The Self is a multifaceted construct and it is reflected in language.<n>Many of the aspects of the Self align with psychological and other well-researched phenomena.<n>This proposal introduces a plan to develop a computational framework to identify Self-aspects in text.
arXiv Detail & Related papers (2025-07-17T13:31:04Z)
Comparing Human Expertise and Large Language Models Embeddings in Content Validity Assessment of Personality Tests [0.0]
We explore the application of Large Language Models (LLMs) in assessing the content validity of psychometric instruments.<n>Using both human expert evaluations and advanced LLMs, we compared the accuracy of semantic item-construct alignment.<n>The results reveal distinct strengths and limitations of human and AI approaches.
arXiv Detail & Related papers (2025-03-15T10:54:35Z)
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models [5.02953506943752]
MM-IQ is a comprehensive evaluation framework that comprises a large-scale training set with 4,776 visual reasoning problems and 2,710 meticulously curated test items spanning 8 distinct reasoning paradigms.<n>Our benchmark reveals striking limitations: even state-of-the-art architectures achieve only marginally superior performance to random chance.<n>Inspired by the recent surge of large reasoning models, we also release a multimodal reasoning model as the baseline that is trained via reinforcement learning with verifiable reward functions.
arXiv Detail & Related papers (2025-02-02T07:12:03Z)
LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z)
Evaluating Large Language Models with Psychometrics [59.821829073478376]
This paper offers a comprehensive benchmark for quantifying psychological constructs of Large Language Models (LLMs)<n>Our work identifies five key psychological constructs -- personality, values, emotional intelligence, theory of mind, and self-efficacy -- assessed through a suite of 13 datasets.<n>We uncover significant discrepancies between LLMs' self-reported traits and their response patterns in real-world scenarios, revealing complexities in their behaviors.
arXiv Detail & Related papers (2024-06-25T16:09:08Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.