PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
- URL: http://arxiv.org/abs/2409.06820v3
- Date: Sun, 09 Feb 2025 20:54:10 GMT
- Title: PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
- Authors: Ilya Gusev,
- Abstract summary: We introduce a benchmark for evaluating the role-playing capabilities of language models.
We leverage different language models to simulate users in dynamic, multi-turn conversations and assess the resulting dialogues.
We evaluated more than 40 models in both English and Russian, with each model participating in 64 conversations with 8 characters and 8 situations.
- Score: 0.0
- License:
- Abstract: We introduce a benchmark for evaluating the role-playing capabilities of language models. Our approach leverages different language models to simulate users in dynamic, multi-turn conversations and assess the resulting dialogues. Our methodology involves three main components: a player model that adopts a specific character role, an interrogator model that simulates user behavior in a specific situation, and a judge model ensemble that evaluates conversation quality with 3 metrics: character consistency, entertainment value, and language fluency. We evaluated more than 40 models in both English and Russian, with each model participating in 64 conversations with 8 characters and 8 situations. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of different model capabilities in interactive scenarios.
Related papers
- DevBench: A multimodal developmental benchmark for language learning [0.34129029452670606]
We introduce DevBench, a benchmark for evaluating vision-language models on tasks and behavioral data.
We show that DevBench provides a benchmark for comparing models to human language development.
These comparisons highlight ways in which model and human language learning processes diverge.
arXiv Detail & Related papers (2024-06-14T17:49:41Z) - Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies.
We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z) - Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation [20.171574438536673]
We introduce a new protocol to measure the degree to which language models can accurately emulate human behavior in conversational recommendation.
We demonstrate these tasks effectively reveal deviations of language models from human behavior, and offer insights on how to reduce the deviations with model selection and prompting strategies.
arXiv Detail & Related papers (2024-03-13T18:16:21Z) - Pseudointelligence: A Unifying Framework for Language Model Evaluation [14.95543156914676]
We propose a complexity-theoretic framework of model evaluation cast as a dynamic interaction between a model and a learned evaluator.
We demonstrate that this framework can be used to reason about two case studies in language model evaluation, as well as analyze existing evaluation methods.
arXiv Detail & Related papers (2023-10-18T17:48:05Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Towards the Scalable Evaluation of Cooperativeness in Language Models [1.7875811547963403]
We aim to understand and shape the multi-agent behaviors of PLMs in a pro-social manner.
We generate scenarios with particular structures with both crowdworkers and a language model.
We find that instruct-tuned models tend to act in a way that could be perceived as cooperative when scaled up.
arXiv Detail & Related papers (2023-03-16T15:34:23Z) - Language Model Cascades [72.18809575261498]
Repeated interactions at test-time with a single model, or the composition of multiple models together, further expands capabilities.
Cases with control flow and dynamic structure require techniques from probabilistic programming.
We formalize several existing techniques from this perspective, including scratchpads / chain of thought, verifiers, STaR, selection-inference, and tool use.
arXiv Detail & Related papers (2022-07-21T07:35:18Z) - Sm{\aa}prat: DialoGPT for Natural Language Generation of Swedish
Dialogue by Transfer Learning [1.6111818380407035]
State-of-the-art models for the generation of natural language dialogue have demonstrated impressive performance in simulating human-like, single-turn conversations in English.
This work investigates, by an empirical study, the potential for transfer learning of such models to Swedish language.
arXiv Detail & Related papers (2021-10-12T18:46:43Z) - Specializing Multilingual Language Models: An Empirical Study [50.7526245872855]
Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks.
For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data.
arXiv Detail & Related papers (2021-06-16T18:13:55Z) - Prototype-to-Style: Dialogue Generation with Style-Aware Editing on
Retrieval Memory [65.98002918470543]
We introduce a new prototype-to-style framework to tackle the challenge of stylistic dialogue generation.
The framework uses an Information Retrieval (IR) system and extracts a response prototype from the retrieved response.
A stylistic response generator then takes the prototype and the desired language style as model input to obtain a high-quality and stylistic response.
arXiv Detail & Related papers (2020-04-05T14:36:15Z) - XPersona: Evaluating Multilingual Personalized Chatbot [76.00426517401894]
We propose a multi-lingual extension of Persona-Chat, namely XPersona.
Our dataset includes persona conversations in six different languages other than English for building and evaluating multilingual personalized agents.
arXiv Detail & Related papers (2020-03-17T07:52:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.