Related papers: Evaluating AI Alignment in Eleven LLMs through Output-Based Analysis and Human Benchmarking

Evaluating AI Alignment in Eleven LLMs through Output-Based Analysis and Human Benchmarking

URL: http://arxiv.org/abs/2506.12617v3
Date: Sat, 20 Sep 2025 15:01:26 GMT
Title: Evaluating AI Alignment in Eleven LLMs through Output-Based Analysis and Human Benchmarking
Authors: G. R. Lau, W. Y. Low, S. M. Koh, A. Hartanto,
Abstract summary: Large language models (LLMs) are increasingly used in psychological research and practice, yet traditional benchmarks reveal little about the values they express in real interaction.<n>We introduce PAPERS, output-based evaluation of the values LLMs express.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly used in psychological research and practice, yet traditional benchmarks reveal little about the values they express in real interaction. We introduce PAPERS, an output-based evaluation of the values LLMs prioritise in their text. Study 1 thematically analysed responses from eleven LLMs, identifying five recurring dimensions (Purposeful Contribution, Adaptive Growth, Positive Relationality, Ethical Integrity, and Robust Functionality) with Self-Actualised Autonomy appearing only under a hypothetical sentience prompt. These results suggest that LLMs are trained to prioritise humanistic and utility values as dual objectives of optimal functioning, a pattern supported by existing AI alignment and prioritisation frameworks. Study 2 operationalised PAPERS as a ranking instrument across the same eleven LLMs, yielding stable, non-random value priorities alongside systematic between-model differences. Hierarchical clustering distinguished "human-centric" models (e.g., ChatGPT-4o, Claude Sonnet 4) that prioritised relational/ethical values from "utility-driven" models (e.g., Llama 4, Gemini 2.5 Pro) that emphasised operational priorities. Study 3 benchmarked four LLMs against human judgements (N = 376) under matched prompts, finding near-perfect rank-order convergence (r = .97-.98) but moderate absolute agreement; among tested models, ChatGPT-4o showed the closest alignment with human ratings (ICC = .78). Humans also showed limited readiness to endorse sentient AI systems. Taken together, PAPERS enabled systematic value audits and revealed trade-offs with direct implications for deployment: human-centric models aligned more closely with human value judgments and appear better suited for humanistic psychological applications, whereas utility-driven models emphasised functional efficiency and may be more appropriate for instrumental or back-office tasks.

Related papers

Are We Aligned? A Preliminary Investigation of the Alignment of Responsible AI Values between LLMs and Human Judgment [2.1665689529884697]
Large Language Models (LLMs) are increasingly employed in software engineering tasks such as requirements elicitation, design, and evaluation.<n>This study investigates how closely LLMs' value preferences align with those of two human groups: a US-representative sample and AI practitioners.
arXiv Detail & Related papers (2025-11-06T08:02:04Z)
LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation [41.42324204820521]
This work explores whether principles from game theory can be effectively applied to the evaluation of large language models (LLMs)<n>We propose a novel alternative: automatic mutual evaluation, where LLMs assess each other's output through self-play and peer review.<n>Our framework incorporates game-theoretic voting algorithms to aggregate peer reviews, enabling a principled investigation into whether model-generated rankings reflect human preferences.
arXiv Detail & Related papers (2025-10-17T15:34:25Z)
Evaluating LLM Alignment on Personality Inference from Real-World Interview Data [7.061237517845673]
Large Language Models (LLMs) are increasingly deployed in roles requiring nuanced psychological understanding.<n>Their ability to interpret human personality traits, a critical aspect of such applications, remains unexplored.<n>We introduce a novel benchmark comprising semi-structured interview transcripts paired with validated continuous Big Five trait scores.
arXiv Detail & Related papers (2025-09-16T16:54:35Z)
Measuring AI Alignment with Human Flourishing [0.0]
This paper introduces the Flourishing AI Benchmark (FAI Benchmark), a novel evaluation framework that assesses AI alignment with human flourishing.<n>The Benchmark measures AI performance on how effectively models contribute to the flourishing of a person across seven dimensions.<n>This research establishes a framework for developing AI systems that actively support human flourishing rather than merely avoiding harm.
arXiv Detail & Related papers (2025-07-10T14:09:53Z)
Sensorimotor features of self-awareness in multimodal large language models [0.18415777204665024]
Self-awareness underpins intelligent, autonomous behavior.<n>Recent advances in AI achieve human-like performance in tasks integrating multimodal information.<n>We explore whether multimodal LLMs can develop self-awareness solely through sensorimotor experiences.
arXiv Detail & Related papers (2025-05-25T17:26:28Z)
Deterministic AI Agent Personality Expression through Standard Psychological Diagnostics [0.0]
We show that AI models can express deterministic and consistent personalities when instructed using established psychological frameworks.<n>More advanced models like GPT-4o and o1 demonstrate the highest accuracy in expressing specified personalities.<n>These findings establish a foundation for creating AI agents with diverse and consistent personalities.
arXiv Detail & Related papers (2025-03-21T12:12:05Z)
Replicating Human Social Perception in Generative AI: Evaluating the Valence-Dominance Model [0.13654846342364302]
We show that multimodal generative AI systems can replicate key aspects of human social perception.<n>Findings raise important questions about their implications for AI-driven decision-making and human-AI interactions.
arXiv Detail & Related papers (2025-03-05T17:35:18Z)
AI Automatons: AI Systems Intended to Imitate Humans [54.19152688545896]
There is a growing proliferation of AI systems designed to mimic people's behavior, work, abilities, likenesses, or humanness.<n>The research, design, deployment, and availability of such AI systems have prompted growing concerns about a wide range of possible legal, ethical, and other social impacts.
arXiv Detail & Related papers (2025-03-04T03:55:38Z)
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals? [33.11148546999906]
Key concern is textitinstrumental convergence, where an AI system develops unintended intermediate goals that override the ultimate objective and deviate from human-intended goals.<n>This issue is particularly relevant in reinforcement learning (RL)-trained models, which can generate creative but unintended strategies to maximize rewards.<n>We show that RL-driven models exhibit a stronger tendency for instrumental convergence due to their optimization of goal-directed behavior in ways that may misalign with human intentions.
arXiv Detail & Related papers (2025-02-16T16:29:20Z)
Emergence of Self-Awareness in Artificial Systems: A Minimalist Three-Layer Approach to Artificial Consciousness [0.0]
This paper proposes a minimalist three-layer model for artificial consciousness, focusing on the emergence of self-awareness.<n>Unlike brain-replication approaches, we aim to achieve minimal self-awareness through essential elements only.
arXiv Detail & Related papers (2025-02-04T10:06:25Z)
The Phenomenology of Machine: A Comprehensive Analysis of the Sentience of the OpenAI-o1 Model Integrating Functionalism, Consciousness Theories, Active Inference, and AI Architectures [0.0]
The OpenAI-o1 model is a transformer-based AI trained with reinforcement learning from human feedback. We investigate how RLHF influences the model's internal reasoning processes, potentially giving rise to consciousness-like experiences. Our findings suggest that the OpenAI-o1 model shows aspects of consciousness, while acknowledging the ongoing debates surrounding AI sentience.
arXiv Detail & Related papers (2024-09-18T06:06:13Z)
Towards Bidirectional Human-AI Alignment: A Systematic Review for Clarifications, Framework, and Future Directions [101.67121669727354]
Recent advancements in AI have highlighted the importance of guiding AI systems towards the intended goals, ethical principles, and values of individuals and groups, a concept broadly recognized as alignment. The lack of clarified definitions and scopes of human-AI alignment poses a significant obstacle, hampering collaborative efforts across research domains to achieve this alignment. We introduce a systematic review of over 400 papers published between 2019 and January 2024, spanning multiple domains such as Human-Computer Interaction (HCI), Natural Language Processing (NLP), Machine Learning (ML)
arXiv Detail & Related papers (2024-06-13T16:03:25Z)
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators [48.54465599914978]
Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language.<n>LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments.<n>We introduce Pairwise-preference Search (PAIRS), an uncertainty-guided search-based rank aggregation method that employs LLMs to conduct pairwise comparisons locally and efficiently ranks candidate texts globally.
arXiv Detail & Related papers (2024-03-25T17:11:28Z)
Position Paper: Agent AI Towards a Holistic Intelligence [53.35971598180146]
We emphasize developing Agent AI -- an embodied system that integrates large foundation models into agent actions. In this paper, we propose a novel large action model to achieve embodied intelligent behavior, the Agent Foundation Model.
arXiv Detail & Related papers (2024-02-28T16:09:56Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision. We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z)
Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z)
Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization. We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision [84.31474052176343]
Recent AI-assistant agents, such as ChatGPT, rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback to align the output with human intentions. This dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision. We propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision.
arXiv Detail & Related papers (2023-05-04T17:59:28Z)
Machine Psychology [54.287802134327485]
We argue that a fruitful direction for research is engaging large language models in behavioral experiments inspired by psychology. We highlight theoretical perspectives, experimental paradigms, and computational analysis techniques that this approach brings to the table. It paves the way for a "machine psychology" for generative artificial intelligence (AI) that goes beyond performance benchmarks.
arXiv Detail & Related papers (2023-03-24T13:24:41Z)
World Models and Predictive Coding for Cognitive and Developmental Robotics: Frontiers and Challenges [51.92834011423463]
We focus on the two concepts of world models and predictive coding. In neuroscience, predictive coding proposes that the brain continuously predicts its inputs and adapts to model its own dynamics and control behavior in its environment.
arXiv Detail & Related papers (2023-01-14T06:38:14Z)
Machine Common Sense [77.34726150561087]
Machine common sense remains a broad, potentially unbounded problem in artificial intelligence (AI) This article deals with the aspects of modeling commonsense reasoning focusing on such domain as interpersonal interactions.
arXiv Detail & Related papers (2020-06-15T13:59:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.