Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models
- URL: http://arxiv.org/abs/2502.07077v1
- Date: Mon, 10 Feb 2025 22:09:57 GMT
- Title: Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models
- Authors: Lujain Ibrahim, Canfer Akbulut, Rasmi Elasmar, Charvi Rastogi, Minsuk Kahng, Meredith Ringel Morris, Kevin R. McKee, Verena Rieser, Murray Shanahan, Laura Weidinger,
- Abstract summary: The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers.
Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings.
First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours.
Second, we present a scalable, automated approach by employing simulations of user interactions.
Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions.
- Score: 26.333097337393685
- License:
- Abstract: The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.
Related papers
- How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation [30.713599131902566]
We introduce BehaviorChain, the first benchmark for evaluating digital twins' ability to simulate continuous human behavior.
BehaviorChain comprises diverse, high-quality, persona-based behavior chains, totaling 15,846 distinct behaviors across 1,001 unique personas.
Comprehensive evaluation results demonstrated that even state-of-the-art models struggle with accurately simulating continuous human behavior.
arXiv Detail & Related papers (2025-02-20T15:29:32Z) - Thinking beyond the anthropomorphic paradigm benefits LLM research [1.7392902719515677]
We analyze hundreds of thousands of computer science research articles from the past decade.
We present empirical evidence of the prevalence and growth of anthropomorphic terminology in research on large language models (LLMs)
We argue these conceptualizations may be limiting, and that challenging them opens up new pathways for understanding and improving LLMs beyond human analogies.
arXiv Detail & Related papers (2025-02-13T11:32:09Z) - From Individual to Society: A Survey on Social Simulation Driven by Large Language Model-based Agents [47.935533238820334]
Traditional sociological research often relies on human participation, which, though effective, is expensive, challenging to scale, and with ethical concerns.
Recent advancements in large language models (LLMs) highlight their potential to simulate human behavior, enabling the replication of individual responses and facilitating studies on many interdisciplinary studies.
We categorize the simulations into three types: (1) Individual Simulation, which mimics specific individuals or demographic groups; (2) Scenario Simulation, where multiple agents collaborate to achieve goals within specific contexts; and (3) Simulation Society, which models interactions within agent societies to reflect the complexity and variety of real-world dynamics.
arXiv Detail & Related papers (2024-12-04T18:56:37Z) - PersLLM: A Personified Training Approach for Large Language Models [66.16513246245401]
We propose PersLLM, integrating psychology-grounded principles of personality: social practice, consistency, and dynamic development.
We incorporate personality traits directly into the model parameters, enhancing the model's resistance to induction, promoting consistency, and supporting the dynamic evolution of personality.
arXiv Detail & Related papers (2024-07-17T08:13:22Z) - Virtual Personas for Language Models via an Anthology of Backstories [5.2112564466740245]
"Anthology" is a method for conditioning large language models to particular virtual personas by harnessing open-ended life narratives.
We show that our methodology enhances the consistency and reliability of experimental outcomes while ensuring better representation of diverse sub-populations.
arXiv Detail & Related papers (2024-07-09T06:11:18Z) - Human Simulacra: Benchmarking the Personification of Large Language Models [38.21708264569801]
Large language models (LLMs) are recognized as systems that closely mimic aspects of human intelligence.
This paper introduces a framework for constructing virtual characters' life stories from the ground up.
Experimental results demonstrate that our constructed simulacra can produce personified responses that align with their target characters.
arXiv Detail & Related papers (2024-02-28T09:11:14Z) - User Behavior Simulation with Large Language Model based Agents [116.74368915420065]
We propose an LLM-based agent framework and design a sandbox environment to simulate real user behaviors.
Based on extensive experiments, we find that the simulated behaviors of our method are very close to the ones of real humans.
arXiv Detail & Related papers (2023-06-05T02:58:35Z) - Machine Psychology [54.287802134327485]
We argue that a fruitful direction for research is engaging large language models in behavioral experiments inspired by psychology.
We highlight theoretical perspectives, experimental paradigms, and computational analysis techniques that this approach brings to the table.
It paves the way for a "machine psychology" for generative artificial intelligence (AI) that goes beyond performance benchmarks.
arXiv Detail & Related papers (2023-03-24T13:24:41Z) - Evaluating Human-Language Model Interaction [79.33022878034627]
We develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems.
We design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation.
We find that better non-interactive performance does not always translate to better human-LM interaction.
arXiv Detail & Related papers (2022-12-19T18:59:45Z) - Human Trajectory Forecasting in Crowds: A Deep Learning Perspective [89.4600982169]
We present an in-depth analysis of existing deep learning-based methods for modelling social interactions.
We propose two knowledge-based data-driven methods to effectively capture these social interactions.
We develop a large scale interaction-centric benchmark TrajNet++, a significant yet missing component in the field of human trajectory forecasting.
arXiv Detail & Related papers (2020-07-07T17:19:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.