Related papers: Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

URL: http://arxiv.org/abs/2502.07077v1
Date: Mon, 10 Feb 2025 22:09:57 GMT
Title: Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models
Authors: Lujain Ibrahim, Canfer Akbulut, Rasmi Elasmar, Charvi Rastogi, Minsuk Kahng, Meredith Ringel Morris, Kevin R. McKee, Verena Rieser, Murray Shanahan, Laura Weidinger,
Abstract summary: The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers.<n>Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings.<n>First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours.<n>Second, we present a scalable, automated approach by employing simulations of user interactions.<n>Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions.
Score: 26.333097337393685
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.

Related papers

Higher-Order Binding of Language Model Virtual Personas: a Study on Approximating Political Partisan Misperceptions [4.234771450043289]
Large language models (LLMs) are increasingly capable of simulating human behavior. We propose a novel methodology for constructing virtual personas with synthetic user backstories" generated as extended, multi-turn interview transcripts. Our generated backstories are longer, rich in detail, and consistent in authentically describing a singular individual, compared to previous methods.
arXiv Detail & Related papers (2025-04-16T00:10:34Z)
SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users [70.02370111025617]
We introduce SocioVerse, an agent-driven world model for social simulation. Our framework features four powerful alignment components and a user pool of 10 million real individuals. Results demonstrate that SocioVerse can reflect large-scale population dynamics while ensuring diversity, credibility, and representativeness.
arXiv Detail & Related papers (2025-04-14T12:12:52Z)
If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs [55.8331366739144]
We introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in large language models (LLMs) Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches.
arXiv Detail & Related papers (2025-03-30T16:50:57Z)
How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation [30.713599131902566]
We introduce BehaviorChain, the first benchmark for evaluating digital twins' ability to simulate continuous human behavior. BehaviorChain comprises diverse, high-quality, persona-based behavior chains, totaling 15,846 distinct behaviors across 1,001 unique personas. Comprehensive evaluation results demonstrated that even state-of-the-art models struggle with accurately simulating continuous human behavior.
arXiv Detail & Related papers (2025-02-20T15:29:32Z)
From Individual to Society: A Survey on Social Simulation Driven by Large Language Model-based Agents [47.935533238820334]
Traditional sociological research often relies on human participation, which, though effective, is expensive, challenging to scale, and with ethical concerns. Recent advancements in large language models (LLMs) highlight their potential to simulate human behavior, enabling the replication of individual responses and facilitating studies on many interdisciplinary studies. We categorize the simulations into three types: (1) Individual Simulation, which mimics specific individuals or demographic groups; (2) Scenario Simulation, where multiple agents collaborate to achieve goals within specific contexts; and (3) Simulation Society, which models interactions within agent societies to reflect the complexity and variety of real-world dynamics.
arXiv Detail & Related papers (2024-12-04T18:56:37Z)
PersLLM: A Personified Training Approach for Large Language Models [66.16513246245401]
We propose PersLLM, integrating psychology-grounded principles of personality: social practice, consistency, and dynamic development. We incorporate personality traits directly into the model parameters, enhancing the model's resistance to induction, promoting consistency, and supporting the dynamic evolution of personality.
arXiv Detail & Related papers (2024-07-17T08:13:22Z)
Virtual Personas for Language Models via an Anthology of Backstories [5.2112564466740245]
"Anthology" is a method for conditioning large language models to particular virtual personas by harnessing open-ended life narratives. We show that our methodology enhances the consistency and reliability of experimental outcomes while ensuring better representation of diverse sub-populations.
arXiv Detail & Related papers (2024-07-09T06:11:18Z)
Human Simulacra: Benchmarking the Personification of Large Language Models [38.21708264569801]
Large language models (LLMs) are recognized as systems that closely mimic aspects of human intelligence. This paper introduces a framework for constructing virtual characters' life stories from the ground up. Experimental results demonstrate that our constructed simulacra can produce personified responses that align with their target characters.
arXiv Detail & Related papers (2024-02-28T09:11:14Z)
User Behavior Simulation with Large Language Model based Agents [116.74368915420065]
We propose an LLM-based agent framework and design a sandbox environment to simulate real user behaviors. Based on extensive experiments, we find that the simulated behaviors of our method are very close to the ones of real humans.
arXiv Detail & Related papers (2023-06-05T02:58:35Z)
Predicting the long-term collective behaviour of fish pairs with deep learning [52.83927369492564]
This study introduces a deep learning model to assess social interactions in the fish species Hemigrammus rhodostomus. We compare the results of our deep learning approach to experiments and to the results of a state-of-the-art analytical model. We demonstrate that machine learning models social interactions can directly compete with their analytical counterparts in subtle experimental observables.
arXiv Detail & Related papers (2023-02-14T05:25:03Z)
Human Trajectory Forecasting in Crowds: A Deep Learning Perspective [89.4600982169]
We present an in-depth analysis of existing deep learning-based methods for modelling social interactions. We propose two knowledge-based data-driven methods to effectively capture these social interactions. We develop a large scale interaction-centric benchmark TrajNet++, a significant yet missing component in the field of human trajectory forecasting.
arXiv Detail & Related papers (2020-07-07T17:19:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.