TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation
- URL: http://arxiv.org/abs/2510.25536v2
- Date: Thu, 30 Oct 2025 11:19:24 GMT
- Title: TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation
- Authors: Bangde Du, Minghao Guo, Songming He, Ziyi Ye, Xi Zhu, Weihang Su, Shuqi Zhu, Yujia Zhou, Yongfeng Zhang, Qingyao Ai, Yiqun Liu,
- Abstract summary: Large Language Models (LLMs) are exhibiting emergent human-like abilities.<n>TwinVoice is a benchmark for assessing persona simulation across diverse real-world contexts.
- Score: 55.55404595177229
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual's communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy in persona simulation, they still fall short of capabilities such as syntactic style and memory recall. Consequently, the average performance achieved by LLMs remains considerably below the human baseline.
Related papers
- Eval4Sim: An Evaluation Framework for Persona Simulation [9.555170143860694]
Eval4Sim is an evaluation framework that measures how closely simulated conversations align with human conversational patterns.<n>Unlike absolute or optimization-oriented metrics, Eval4Sim uses a human conversational corpus as a reference baseline and penalizes deviations in both directions.
arXiv Detail & Related papers (2026-03-03T11:30:50Z) - HumanLLM: Towards Personalized Understanding and Simulation of Human Nature [72.55730315685837]
HumanLLM is a foundation model designed for personalized understanding and simulation of individuals.<n>We first construct the Cognitive Genome, a large-scale corpus curated from real-world user data on platforms like Reddit, Twitter, Blogger, and Amazon.<n>We then formulate diverse learning tasks and perform supervised fine-tuning to empower the model to predict a wide range of individualized human behaviors, thoughts, and experiences.
arXiv Detail & Related papers (2026-01-22T09:27:27Z) - HUMANLLM: Benchmarking and Reinforcing LLM Anthropomorphism via Human Cognitive Patterns [59.17423586203706]
We present HUMANLLM, a framework treating psychological patterns as interacting causal forces.<n>We construct 244 patterns from 12,000 academic papers and synthesize 11,359 scenarios where 2-5 patterns reinforce, conflict, or modulate each other.<n>Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics, achieving strong human alignment.
arXiv Detail & Related papers (2026-01-15T08:56:53Z) - Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning [52.07170679746533]
Large Language Models (LLMs) are increasingly used to simulate human users in interactive settings such as therapy, education, and social role-play.<n>We introduce a unified framework for evaluating and improving persona consistency in LLM-generated dialogue.<n>We define three automatic metrics: prompt-to-line consistency, line-to-line consistency, and Q&A consistency, that capture different types of persona drift and validate each against human annotations.
arXiv Detail & Related papers (2025-10-31T19:40:41Z) - Scaling Law in LLM Simulated Personality: More Detailed and Realistic Persona Profile Is All You Need [17.298070053011802]
This research focuses on using large language models (LLMs) to simulate social experiments, exploring their ability to emulate human personality in virtual persona role-playing.<n>The research develops an end-to-end evaluation framework, including individual-level analysis of stability and identifiability.
arXiv Detail & Related papers (2025-10-10T05:52:07Z) - PersonaFuse: A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions [14.497181581363288]
PersonaFuse is a novel framework that enables Large Language Models to adapt and express different personalities.<n>Tests show PersonaFuse substantially outperforms baseline models across multiple dimensions of social-emotional intelligence.<n> PersonaFuse also delivers consistent improvements in downstream human-centered applications.
arXiv Detail & Related papers (2025-09-09T03:39:28Z) - SocialEval: Evaluating Social Intelligence of Large Language Models [70.90981021629021]
Social Intelligence (SI) equips humans with interpersonal abilities to behave wisely in navigating social interactions to achieve social goals.<n>This presents an operational evaluation paradigm: outcome-oriented goal achievement evaluation and process-oriented interpersonal ability evaluation.<n>We propose SocialEval, a script-based bilingual SI benchmark, integrating outcome- and process-oriented evaluation by manually crafting narrative scripts.
arXiv Detail & Related papers (2025-06-01T08:36:51Z) - IMPersona: Evaluating Individual Level LM Impersonation [28.040025302581366]
We introduce IMPersona, a framework for evaluating LMs at impersonating specific individuals' writing style and personal knowledge.<n>We demonstrate that even modestly sized open-source models, such as Llama-3.1-8B-Instruct, can achieve impersonation abilities at concerning levels.
arXiv Detail & Related papers (2025-04-06T02:57:58Z) - How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation [30.713599131902566]
We introduce BehaviorChain, the first benchmark for evaluating digital twins' ability to simulate continuous human behavior.<n> BehaviorChain comprises diverse, high-quality, persona-based behavior chains, totaling 15,846 distinct behaviors across 1,001 unique personas.<n> Comprehensive evaluation results demonstrated that even state-of-the-art models struggle with accurately simulating continuous human behavior.
arXiv Detail & Related papers (2025-02-20T15:29:32Z) - PersLLM: A Personified Training Approach for Large Language Models [66.16513246245401]
We propose PersLLM, a framework for better data construction and model tuning.<n>For insufficient data usage, we incorporate strategies such as Chain-of-Thought prompting and anti-induction.<n>For rigid behavior patterns, we design the tuning process and introduce automated DPO to enhance the specificity and dynamism of the models' personalities.
arXiv Detail & Related papers (2024-07-17T08:13:22Z) - LLMs Simulate Big Five Personality Traits: Further Evidence [51.13560635563004]
We analyze the personality traits simulated by Llama2, GPT4, and Mixtral.
This contributes to the broader understanding of the capabilities of LLMs to simulate personality traits.
arXiv Detail & Related papers (2024-01-31T13:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.