Related papers: Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement

Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement

URL: http://arxiv.org/abs/2505.08245v2
Date: Sun, 13 Jul 2025 08:25:01 GMT
Title: Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement
Authors: Haoran Ye, Jing Jin, Yuhang Xie, Xin Zhang, Guojie Song,
Abstract summary: Review paper introduces and synthesizes the emerging interdisciplinary field of LLM Psychometrics.<n>Psychometrics quantify intangible aspects of human psychology, such as personality, values, and intelligence.<n>Ultimately, the review provides actionable insights for developing future evaluation paradigms that align with human-level AI.
Score: 16.608577295968942
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The advancement of large language models (LLMs) has outpaced traditional evaluation methodologies. This progress presents novel challenges, such as measuring human-like psychological constructs, moving beyond static and task-specific benchmarks, and establishing human-centered evaluation. These challenges intersect with psychometrics, the science of quantifying the intangible aspects of human psychology, such as personality, values, and intelligence. This review paper introduces and synthesizes the emerging interdisciplinary field of LLM Psychometrics, which leverages psychometric instruments, theories, and principles to evaluate, understand, and enhance LLMs. The reviewed literature systematically shapes benchmarking principles, broadens evaluation scopes, refines methodologies, validates results, and advances LLM capabilities. Diverse perspectives are integrated to provide a structured framework for researchers across disciplines, enabling a more comprehensive understanding of this nascent field. Ultimately, the review provides actionable insights for developing future evaluation paradigms that align with human-level AI and promote the advancement of human-centered AI systems for societal benefit. A curated repository of LLM psychometric resources is available at https://github.com/valuebyte-ai/Awesome-LLM-Psychometrics.

Related papers

Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
Large Language Models (LLMs) in medicine have enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning.<n>This paper provides the first systematic review of this emerging field.<n>We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies and test-time mechanisms.
arXiv Detail & Related papers (2025-08-01T14:41:31Z)
Alignment and Safety in Large Language Models: Safety Mechanisms, Training Paradigms, and Emerging Challenges [47.14342587731284]
This survey provides a comprehensive overview of alignment techniques, training protocols, and empirical findings in large language models (LLMs) alignment.<n>We analyze the development of alignment methods across diverse paradigms, characterizing the fundamental trade-offs between core alignment objectives.<n>We discuss state-of-the-art techniques, including Direct Preference Optimization (DPO), Constitutional AI, brain-inspired methods, and alignment uncertainty quantification (AUQ)
arXiv Detail & Related papers (2025-07-25T20:52:58Z)
Humanizing LLMs: A Survey of Psychological Measurements with Tools, Datasets, and Human-Agent Applications [25.38031971196831]
Large language models (LLMs) are increasingly used in human-centered tasks.<n>Assessing their psychological traits is crucial for understanding their social impact and ensuring trustworthy AI alignment.<n>This study aims to propose future directions for developing more interpretable, robust, and generalizable psychological assessment frameworks for LLMs.
arXiv Detail & Related papers (2025-04-30T06:09:40Z)
Measurement of LLM's Philosophies of Human Nature [113.47929131143766]
We design the standardized psychological scale specifically targeting large language models (LLM)<n>We show that current LLMs exhibit a systemic lack of trust in humans.<n>We propose a mental loop learning framework, which enables LLM to continuously optimize its value system.
arXiv Detail & Related papers (2025-04-03T06:22:19Z)
Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey [64.08485471150486]
This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings.<n>We systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication.
arXiv Detail & Related papers (2025-03-28T14:08:40Z)
Comparing Human Expertise and Large Language Models Embeddings in Content Validity Assessment of Personality Tests [0.0]
We explore the application of Large Language Models (LLMs) in assessing the content validity of psychometric instruments.<n>Using both human expert evaluations and advanced LLMs, we compared the accuracy of semantic item-construct alignment.<n>The results reveal distinct strengths and limitations of human and AI approaches.
arXiv Detail & Related papers (2025-03-15T10:54:35Z)
Bridging the Evaluation Gap: Leveraging Large Language Models for Topic Model Evaluation [0.0]
This study presents a framework for automated evaluation of dynamically evolving topic in scientific literature using Large Language Models (LLMs)<n>The proposed approach harnesses LLMs to measure key quality dimensions, such as coherence, repetitiveness, diversity, and topic-document alignment, without heavy reliance on expert annotators or narrow statistical metrics.
arXiv Detail & Related papers (2025-02-11T08:23:56Z)
Measuring Human and AI Values Based on Generative Psychometrics with Large Language Models [13.795641564238434]
This work introduces Generative Psychometrics for Values (GPV)<n>GPV is a data-driven value measurement paradigm grounded in text-revealed selective perceptions.<n>Applying GPV to human-authored blogs, we demonstrate its stability, validity, and superiority over prior psychological tools.
arXiv Detail & Related papers (2024-09-18T16:26:22Z)
Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models [57.518784855080334]
Large Language Models (LLMs) have demonstrated exceptional task-solving capabilities, increasingly adopting roles akin to human-like assistants. This paper presents a framework for investigating psychology dimension in LLMs, including psychological identification, assessment dataset curation, and assessment with results validation. We introduce a comprehensive psychometrics benchmark for LLMs that covers six psychological dimensions: personality, values, emotion, theory of mind, motivation, and intelligence.
arXiv Detail & Related papers (2024-06-25T16:09:08Z)
A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review [11.28580626017631]
We highlight a notable need for a standardized and consistent human evaluation approach. We propose a comprehensive and practical framework for human evaluation of large language models (LLMs) This framework aims to improve the reliability, generalizability, and applicability of human evaluation of LLMs in different healthcare applications.
arXiv Detail & Related papers (2024-05-04T04:16:07Z)
Beyond Human Norms: Unveiling Unique Values of Large Language Models through Interdisciplinary Approaches [69.73783026870998]
This work proposes a novel framework, ValueLex, to reconstruct Large Language Models' unique value system from scratch. Based on Lexical Hypothesis, ValueLex introduces a generative approach to elicit diverse values from 30+ LLMs. We identify three core value dimensions, Competence, Character, and Integrity, each with specific subdimensions, revealing that LLMs possess a structured, albeit non-human, value system.
arXiv Detail & Related papers (2024-04-19T09:44:51Z)
Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence [5.147767778946168]
We critically assess 23 state-of-the-art Large Language Models (LLMs) benchmarks. Our research uncovered significant limitations, including biases, difficulties in measuring genuine reasoning, adaptability, implementation inconsistencies, prompt engineering complexity, diversity, and the overlooking of cultural and ideological norms.
arXiv Detail & Related papers (2024-02-15T11:08:10Z)
Machine Psychology [54.287802134327485]
We argue that a fruitful direction for research is engaging large language models in behavioral experiments inspired by psychology. We highlight theoretical perspectives, experimental paradigms, and computational analysis techniques that this approach brings to the table. It paves the way for a "machine psychology" for generative artificial intelligence (AI) that goes beyond performance benchmarks.
arXiv Detail & Related papers (2023-03-24T13:24:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.