Related papers: Whose Personae? Synthetic Persona Experiments in LLM Research and Pathways to Transparency

Whose Personae? Synthetic Persona Experiments in LLM Research and Pathways to Transparency

URL: http://arxiv.org/abs/2512.00461v1
Date: Sat, 29 Nov 2025 12:27:34 GMT
Title: Whose Personae? Synthetic Persona Experiments in LLM Research and Pathways to Transparency
Authors: Jan Batzner, Volker Stocker, Bingjun Tang, Anusha Natarajan, Qinhao Chen, Stefan Schmid, Gjergji Kasneci,
Abstract summary: Review of 63 studies published between 2023 and 2025 in leading NLP and AI venues.<n>We show that task and population of interest are often underspecified in persona-based experiments.<n>We introduce a persona transparency checklist that emphasizes representative sampling, explicit grounding in empirical data, and enhanced ecological validity.
Score: 12.62715115816099
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Synthetic personae experiments have become a prominent method in Large Language Model alignment research, yet the representativeness and ecological validity of these personae vary considerably between studies. Through a review of 63 peer-reviewed studies published between 2023 and 2025 in leading NLP and AI venues, we reveal a critical gap: task and population of interest are often underspecified in persona-based experiments, despite personalization being fundamentally dependent on these criteria. Our analysis shows substantial differences in user representation, with most studies focusing on limited sociodemographic attributes and only 35% discussing the representativeness of their LLM personae. Based on our findings, we introduce a persona transparency checklist that emphasizes representative sampling, explicit grounding in empirical data, and enhanced ecological validity. Our work provides both a comprehensive assessment of current practices and practical guidelines to improve the rigor and ecological validity of persona-based evaluations in language model alignment research.

Related papers

Towards Personalized Deep Research: Benchmarks and Evaluations [56.581105664044436]
We introduce Personalized Deep Research Bench, the first benchmark for evaluating personalization in Deep Research Agents (DRAs)<n>It pairs 50 diverse research tasks with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries.<n>Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research.
arXiv Detail & Related papers (2025-09-29T17:39:17Z)
Applying Psychometrics to Large Language Model Simulated Populations: Recreating the HEXACO Personality Inventory Experiment with Generative Agents [0.0]
Generative agents demonstrate human-like characteristics through sophisticated natural language interactions.<n>Their ability to assume roles and personalities based on predefined character biographies has positioned them as cost-effective substitutes for human participants in social science research.<n>This paper explores the validity of such persona-based agents in representing human populations.
arXiv Detail & Related papers (2025-08-01T16:16:16Z)
Using AI to replicate human experimental results: a motion study [0.11838866556981258]
This paper explores the potential of large language models (LLMs) as reliable analytical tools in linguistic research.<n>It focuses on the emergence of affective meanings in temporal expressions involving manner-of-motion verbs.
arXiv Detail & Related papers (2025-07-14T14:47:01Z)
What Makes a Good Natural Language Prompt? [72.3282960118995]
We conduct a meta-analysis surveying more than 150 prompting-related papers from leading NLP and AI conferences from 2022 to 2025.<n>We propose a property- and human-centric framework for evaluating prompt quality, encompassing 21 properties categorized into six dimensions.<n>We then empirically explore multi-property prompt enhancements in reasoning tasks, observing that single-property enhancements often have the greatest impact.
arXiv Detail & Related papers (2025-06-07T23:19:27Z)
Human Preferences in Large Language Model Latent Space: A Technical Analysis on the Reliability of Synthetic Data in Voting Outcome Prediction [5.774786149181393]
We analyze how demographic attributes and prompt variations influence latent opinion mappings in large language models (LLMs)<n>We find that LLM-generated data fails to replicate the variance observed in real-world human responses.<n>In the political space, persona-to-party mappings exhibit limited differentiation, resulting in synthetic data that lacks the nuanced distribution of opinions found in survey data.
arXiv Detail & Related papers (2025-02-22T16:25:33Z)
PersLLM: A Personified Training Approach for Large Language Models [66.16513246245401]
We propose PersLLM, a framework for better data construction and model tuning.<n>For insufficient data usage, we incorporate strategies such as Chain-of-Thought prompting and anti-induction.<n>For rigid behavior patterns, we design the tuning process and introduce automated DPO to enhance the specificity and dynamism of the models' personalities.
arXiv Detail & Related papers (2024-07-17T08:13:22Z)
Virtual Personas for Language Models via an Anthology of Backstories [5.2112564466740245]
"Anthology" is a method for conditioning large language models to particular virtual personas by harnessing open-ended life narratives. We show that our methodology enhances the consistency and reliability of experimental outcomes while ensuring better representation of diverse sub-populations.
arXiv Detail & Related papers (2024-07-09T06:11:18Z)
Evaluating Large Language Models with Psychometrics [59.821829073478376]
This paper offers a comprehensive benchmark for quantifying psychological constructs of Large Language Models (LLMs)<n>Our work identifies five key psychological constructs -- personality, values, emotional intelligence, theory of mind, and self-efficacy -- assessed through a suite of 13 datasets.<n>We uncover significant discrepancies between LLMs' self-reported traits and their response patterns in real-world scenarios, revealing complexities in their behaviors.
arXiv Detail & Related papers (2024-06-25T16:09:08Z)
ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z)
Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting [64.80538055623842]
sociodemographic prompting is a technique that steers the output of prompt-based models towards answers that humans with specific sociodemographic profiles would give. We show that sociodemographic information affects model predictions and can be beneficial for improving zero-shot learning in subjective NLP tasks.
arXiv Detail & Related papers (2023-09-13T15:42:06Z)
Framework-Based Qualitative Analysis of Free Responses of Large Language Models: Algorithmic Fidelity [1.7947441434255664]
Large-scale generative Language Models (LLMs) can simulate free responses to interview questions like those traditionally analyzed using qualitative research methods. Here we consider whether artificial "silicon participants" generated by LLMs may be productively studied using qualitative methods.
arXiv Detail & Related papers (2023-09-06T15:00:44Z)
Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models. Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.