Related papers: Evaluating LLM Alignment on Personality Inference from Real-World Interview Data

Evaluating LLM Alignment on Personality Inference from Real-World Interview Data

URL: http://arxiv.org/abs/2509.13244v1
Date: Tue, 16 Sep 2025 16:54:35 GMT
Title: Evaluating LLM Alignment on Personality Inference from Real-World Interview Data
Authors: Jianfeng Zhu, Julina Maharjan, Xinyu Li, Karin G. Coifman, Ruoming Jin,
Abstract summary: Large Language Models (LLMs) are increasingly deployed in roles requiring nuanced psychological understanding.<n>Their ability to interpret human personality traits, a critical aspect of such applications, remains unexplored.<n>We introduce a novel benchmark comprising semi-structured interview transcripts paired with validated continuous Big Five trait scores.
Score: 7.061237517845673
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly deployed in roles requiring nuanced psychological understanding, such as emotional support agents, counselors, and decision-making assistants. However, their ability to interpret human personality traits, a critical aspect of such applications, remains unexplored, particularly in ecologically valid conversational settings. While prior work has simulated LLM "personas" using discrete Big Five labels on social media data, the alignment of LLMs with continuous, ground-truth personality assessments derived from natural interactions is largely unexamined. To address this gap, we introduce a novel benchmark comprising semi-structured interview transcripts paired with validated continuous Big Five trait scores. Using this dataset, we systematically evaluate LLM performance across three paradigms: (1) zero-shot and chain-of-thought prompting with GPT-4.1 Mini, (2) LoRA-based fine-tuning applied to both RoBERTa and Meta-LLaMA architectures, and (3) regression using static embeddings from pretrained BERT and OpenAI's text-embedding-3-small. Our results reveal that all Pearson correlations between model predictions and ground-truth personality traits remain below 0.26, highlighting the limited alignment of current LLMs with validated psychological constructs. Chain-of-thought prompting offers minimal gains over zero-shot, suggesting that personality inference relies more on latent semantic representation than explicit reasoning. These findings underscore the challenges of aligning LLMs with complex human attributes and motivate future work on trait-specific prompting, context-aware modeling, and alignment-oriented fine-tuning.

Related papers

Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning [52.07170679746533]
Large Language Models (LLMs) are increasingly used to simulate human users in interactive settings such as therapy, education, and social role-play.<n>We introduce a unified framework for evaluating and improving persona consistency in LLM-generated dialogue.<n>We define three automatic metrics: prompt-to-line consistency, line-to-line consistency, and Q&A consistency, that capture different types of persona drift and validate each against human annotations.
arXiv Detail & Related papers (2025-10-31T19:40:41Z)
TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation [55.55404595177229]
Large Language Models (LLMs) are exhibiting emergent human-like abilities.<n>TwinVoice is a benchmark for assessing persona simulation across diverse real-world contexts.
arXiv Detail & Related papers (2025-10-29T14:00:42Z)
Benchmarking Large Language Models for Personalized Guidance in AI-Enhanced Learning [4.990353320509215]
Large Language Models (LLMs) are increasingly envisioned as intelligent assistants for personalized learning.<n>This study conducts an empirical comparison of three state-of-the-art LLMs on a tutoring task that simulates a realistic learning setting.
arXiv Detail & Related papers (2025-09-02T14:21:59Z)
CAPE: Context-Aware Personality Evaluation Framework for Large Language Models [8.618075786777219]
We propose the first Context-Aware Personality Evaluation framework for Large Language Models (LLMs)<n>Our experiments reveal that conversational history enhances response consistency via in-context learning but also induces personality shifts.<n>Our framework can be applied to Role Playing Agents (RPAs) to better align with human judgments.
arXiv Detail & Related papers (2025-08-28T03:17:47Z)
Can LLMs Infer Personality from Real World Conversations? [5.705775078773656]
Large Language Models (LLMs) offer a promising approach for scalable personality assessment from open-ended language.<n>Three state-of-the-art LLMs were tested using zero-shot prompting for BFI-10 item prediction and both zero-shot and chain-of-thought prompting for Big Five trait inference.<n>All models showed high test-retest reliability, but construct validity was limited.
arXiv Detail & Related papers (2025-07-18T20:22:47Z)
Evaluating AI Alignment in Eleven LLMs through Output-Based Analysis and Human Benchmarking [0.0]
Large language models (LLMs) are increasingly used in psychological research and practice, yet traditional benchmarks reveal little about the values they express in real interaction.<n>We introduce PAPERS, output-based evaluation of the values LLMs express.
arXiv Detail & Related papers (2025-06-14T20:14:02Z)
If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs [55.8331366739144]
We introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in large language models (LLMs)<n>Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches.
arXiv Detail & Related papers (2025-03-30T16:50:57Z)
Evaluating Large Language Models with Psychometrics [59.821829073478376]
This paper offers a comprehensive benchmark for quantifying psychological constructs of Large Language Models (LLMs)<n>Our work identifies five key psychological constructs -- personality, values, emotional intelligence, theory of mind, and self-efficacy -- assessed through a suite of 13 datasets.<n>We uncover significant discrepancies between LLMs' self-reported traits and their response patterns in real-world scenarios, revealing complexities in their behaviors.
arXiv Detail & Related papers (2024-06-25T16:09:08Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs) We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence. We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)
Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models. Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.