Measure what Matters: Psychometric Evaluation of AI with Situational Judgment Tests
- URL: http://arxiv.org/abs/2510.22170v1
- Date: Sat, 25 Oct 2025 05:45:10 GMT
- Title: Measure what Matters: Psychometric Evaluation of AI with Situational Judgment Tests
- Authors: Alexandra Yost, Shreyans Jain, Shivam Raval, Grant Corser, Allen Roush, Nina Xu, Jacqueline Hammack, Ravid Shwartz-Ziv, Amirali Abdullah,
- Abstract summary: We propose a framework that uses situational judgment tests (SJTs) from realistic scenarios to probe domain-specific competencies.<n>We construct a rich dataset of personas drawn across 8 persona archetypes and SJTs across 11 attributes.<n>The dataset spans 8,500 personas, 4,000 SJTs, and 300,000 responses.
- Score: 37.108535991604576
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AI psychometrics evaluates AI systems in roles that traditionally require emotional judgment and ethical consideration. Prior work often reuses human trait inventories (Big Five, \hexaco) or ad hoc personas, limiting behavioral realism and domain relevance. We propose a framework that (1) uses situational judgment tests (SJTs) from realistic scenarios to probe domain-specific competencies; (2) integrates industrial-organizational and personality psychology to design sophisticated personas which include behavioral and psychological descriptors, life history, and social and emotional functions; and (3) employs structured generation with population demographic priors and memoir inspired narratives, encoded with Pydantic schemas. In a law enforcement assistant case study, we construct a rich dataset of personas drawn across 8 persona archetypes and SJTs across 11 attributes, and analyze behaviors across subpopulation and scenario slices. The dataset spans 8,500 personas, 4,000 SJTs, and 300,000 responses. We will release the dataset and all code to the public.
Related papers
- HumanLLM: Towards Personalized Understanding and Simulation of Human Nature [72.55730315685837]
HumanLLM is a foundation model designed for personalized understanding and simulation of individuals.<n>We first construct the Cognitive Genome, a large-scale corpus curated from real-world user data on platforms like Reddit, Twitter, Blogger, and Amazon.<n>We then formulate diverse learning tasks and perform supervised fine-tuning to empower the model to predict a wide range of individualized human behaviors, thoughts, and experiences.
arXiv Detail & Related papers (2026-01-22T09:27:27Z) - A Computational Framework for Interpretable Text-Based Personality Assessment from Social Media [0.0]
This thesis presents two datasets -- MBTI9k and PANDORA -- collected from Reddit.<n>The PANDORA dataset contains 17 million comments from over 10,000 users.<n>In response, the SIMPA framework was developed - a computational framework for interpretable personality assessment.
arXiv Detail & Related papers (2025-10-03T08:36:36Z) - SENSE-7: Taxonomy and Dataset for Measuring User Perceptions of Empathy in Sustained Human-AI Conversations [13.232694774856931]
We propose a human-centered taxonomy that emphasizes observable empathic behaviors.<n>We introduce a new dataset, Sense-7, of real-world conversations between information workers and Large Language Models (LLMs)<n>Analysis of 695 conversations from 109 participants reveals that empathy judgments are highly individualized, context-sensitive, and vulnerable to disruption.
arXiv Detail & Related papers (2025-09-19T21:32:24Z) - Sentiment Simulation using Generative AI Agents [0.0]
We present a framework for sentiment simulation using generative AI agents embedded with psychologically rich profiles.<n>Agents are instantiated from a nationally representative survey of 2,485 Filipino respondents.<n>Our findings establish a scalable framework for sentiment modeling through psychographically grounded AI agents.
arXiv Detail & Related papers (2025-05-28T08:50:56Z) - Twenty Years of Personality Computing: Threats, Challenges and Future Directions [76.46813522861632]
Personality Computing is a field at the intersection of Personality Psychology and Computer Science.<n>This paper provides an overview of the field, explores key methodologies, discusses the challenges and threats, and outlines potential future directions for responsible development and deployment of Personality Computing technologies.
arXiv Detail & Related papers (2025-03-03T22:03:48Z) - Generative Agent Simulations of 1,000 People [56.82159813294894]
We present a novel agent architecture that simulates the attitudes and behaviors of 1,052 real individuals.
The generative agents replicate participants' responses on the General Social Survey 85% as accurately as participants replicate their own answers.
Our architecture reduces accuracy biases across racial and ideological groups compared to agents given demographic descriptions.
arXiv Detail & Related papers (2024-11-15T11:14:34Z) - Evaluating Large Language Models with Psychometrics [59.821829073478376]
This paper offers a comprehensive benchmark for quantifying psychological constructs of Large Language Models (LLMs)<n>Our work identifies five key psychological constructs -- personality, values, emotional intelligence, theory of mind, and self-efficacy -- assessed through a suite of 13 datasets.<n>We uncover significant discrepancies between LLMs' self-reported traits and their response patterns in real-world scenarios, revealing complexities in their behaviors.
arXiv Detail & Related papers (2024-06-25T16:09:08Z) - PsyMo: A Dataset for Estimating Self-Reported Psychological Traits from
Gait [4.831663144935878]
PsyMo is a novel, multi-purpose and multi-modal dataset for exploring psychological cues manifested in walking patterns.
We gathered walking sequences from 312 subjects in 7 different walking variations and 6 camera angles.
In conjunction with walking sequences, participants filled in 6 psychological questionnaires, totalling 17 psychometric attributes related to personality, self-esteem, fatigue, aggressiveness and mental health.
arXiv Detail & Related papers (2023-08-21T11:06:43Z) - Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench [83.41621219298489]
We evaluate Large Language Models' (LLMs) anthropomorphic capabilities using the emotion appraisal theory from psychology.
We collect a dataset containing over 400 situations that have proven effective in eliciting the eight emotions central to our study.
We conduct a human evaluation involving more than 1,200 subjects worldwide.
arXiv Detail & Related papers (2023-08-07T15:18:30Z) - Jointly Predicting Job Performance, Personality, Cognitive Ability,
Affect, and Well-Being [42.67003631848889]
We create a benchmark for predictive analysis of individuals from a perspective that integrates physical and physiological behavior, psychological states and traits, and job performance.
We design data mining techniques as benchmark and uses real noisy and incomplete data derived from wearable sensors to predict 19 constructs based on 12 standardized well-validated tests.
arXiv Detail & Related papers (2020-06-10T14:30:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.