Related papers: RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing

RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing

URL: http://arxiv.org/abs/2507.20352v1
Date: Sun, 27 Jul 2025 16:49:47 GMT
Title: RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing
Authors: Hao Xiang, Tianyi Tang, Yang Su, Bowen Yu, An Yang, Fei Huang, Yichang Zhang, Yaojie Lu, Hongyu Lin, Xianpei Han, Jingren Zhou, Junyang Lin, Le Sun,
Abstract summary: RMTBench is a comprehensive textbfuser-centric bilingual role-playing benchmark featuring 80 diverse characters and over 8,000 dialogue rounds.<n>Our benchmark constructs dialogues based on explicit user motivations rather than character descriptions, ensuring alignment with practical user applications.<n>By shifting focus from character background to user intention fulfillment, RMTBench bridges the gap between academic evaluation and practical deployment requirements.
Score: 111.06936588273868
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in Large Language Models (LLMs) have shown outstanding potential for role-playing applications. Evaluating these capabilities is becoming crucial yet remains challenging. Existing benchmarks mostly adopt a \textbf{character-centric} approach, simplify user-character interactions to isolated Q&A tasks, and fail to reflect real-world applications. To address this limitation, we introduce RMTBench, a comprehensive \textbf{user-centric} bilingual role-playing benchmark featuring 80 diverse characters and over 8,000 dialogue rounds. RMTBench includes custom characters with detailed backgrounds and abstract characters defined by simple traits, enabling evaluation across various user scenarios. Our benchmark constructs dialogues based on explicit user motivations rather than character descriptions, ensuring alignment with practical user applications. Furthermore, we construct an authentic multi-turn dialogue simulation mechanism. With carefully selected evaluation dimensions and LLM-based scoring, this mechanism captures the complex intention of conversations between the user and the character. By shifting focus from character background to user intention fulfillment, RMTBench bridges the gap between academic evaluation and practical deployment requirements, offering a more effective framework for assessing role-playing capabilities in LLMs. All code and datasets will be released soon.

Related papers

Test-Time-Matching: Decouple Personality, Memory, and Linguistic Style in LLM-based Role-Playing Language Agent [18.67432557362308]
Test-Time-Matching (TTM) is a training-free role-playing framework through test-time scaling and context engineering.<n>Our framework involves a structured, three-stage generation pipeline that utilizes these features for controlled role-playing.<n>It achieves high-fidelity role-playing performance, also enables seamless combinations across diverse linguistic styles and even variations in personality and memory.
arXiv Detail & Related papers (2025-07-22T17:47:44Z)
MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation [49.12071445991853]
Large Language Models (textbfLLMs) have been widely adopted in real-world dialogue applications.<n>MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues.<n>Experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives.
arXiv Detail & Related papers (2025-05-27T10:28:04Z)
A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations [112.81207927088117]
PersonaConvBench is a benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs)<n>We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements.
arXiv Detail & Related papers (2025-05-20T09:13:22Z)
Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles [37.43150003866563]
We introduce the User Simulator with Implicit Profiles (USP), a framework that infers implicit user profiles from human-machine interactions to simulate personalized and realistic dialogues.<n>USP outperforms strong baselines in terms of authenticity and diversity while maintaining comparable consistency.
arXiv Detail & Related papers (2025-02-26T09:26:54Z)
CoSER: Coordinating LLM-Based Persona Simulation of Established Roles [62.886267684392635]
CoSER dataset covers 17,966 characters from 771 renowned books.<n>We develop CoSER 8B and CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models.
arXiv Detail & Related papers (2025-02-13T08:55:24Z)
CharacterBench: Benchmarking Character Customization of Large Language Models [80.29164862682063]
We propose CharacterBench, the largest bilingual generative benchmark, with 22,859 human-annotated samples covering 3,956 characters.<n>We define 11 dimensions of 6 aspects, classified as sparse and dense dimensions based on whether character features evaluated by specific dimensions manifest in each response.<n>We also develop CharacterJudge model for cost-effective and stable evaluations.
arXiv Detail & Related papers (2024-12-16T15:55:34Z)
FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human Feedback [33.532239489610056]
FB-Bench is a benchmark designed to evaluate Large Language Models' responsiveness to human feedback under real-world usage scenarios in Chinese.<n>We extensively evaluate a broad array of popular LLMs, revealing significant variations in their performance across different interaction scenarios.<n>Our findings underscore both the strengths and limitations of current models, providing valuable insights and directions for future research.
arXiv Detail & Related papers (2024-10-12T07:40:01Z)
Role-playing Prompt Framework: Generation and Evaluation [3.2845546753303867]
Large language models (LLMs) exhibit impressive proficiency in natural language generation, understanding user instructions, and emulating human-like language use.<n>This paper introduces a prompt-based framework designed to leverage GPT's capabilities for the generation of role-playing dialogue datasets.
arXiv Detail & Related papers (2024-06-02T06:09:56Z)
DuetSim: Building User Simulator with Dual Large Language Models for Task-Oriented Dialogues [7.765092134290888]
This paper introduces DuetSim, a novel framework designed to address the intricate demands of task-oriented dialogues by leveraging large language models. DuetSim stands apart from conventional approaches by employing two LLMs in tandem: one dedicated to response generation and the other focused on verification. We validate the efficacy of our method through extensive experiments conducted on the MultiWOZ dataset, highlighting improvements in response quality and correctness.
arXiv Detail & Related papers (2024-05-16T06:24:31Z)
Cue-CoT: Chain-of-thought Prompting for Responding to In-depth Dialogue Questions with LLMs [59.74002011562726]
We propose a novel linguistic cue-based chain-of-thoughts (textitCue-CoT) to provide a more personalized and engaging response. We build a benchmark with in-depth dialogue questions, consisting of 6 datasets in both Chinese and English. Empirical results demonstrate our proposed textitCue-CoT method outperforms standard prompting methods in terms of both textithelpfulness and textitacceptability on all datasets.
arXiv Detail & Related papers (2023-05-19T16:27:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.