CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent
Evaluation
- URL: http://arxiv.org/abs/2401.01275v2
- Date: Tue, 9 Jan 2024 18:54:05 GMT
- Title: CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent
Evaluation
- Authors: Quan Tu, Shilong Fan, Zihang Tian, Rui Yan
- Abstract summary: We introduce CharacterEval, a Chinese benchmark for comprehensive RPCA assessment.
The dataset comprises 1,785 multi-turn role-playing dialogues, encompassing 23,020 examples and featuring 77 characters.
CharacterEval employs a multifaceted evaluation approach, encompassing thirteen targeted metrics on four dimensions.
- Score: 20.31635307504041
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently, the advent of large language models (LLMs) has revolutionized
generative agents. Among them, Role-Playing Conversational Agents (RPCAs)
attract considerable attention due to their ability to emotionally engage
users. However, the absence of a comprehensive benchmark impedes progress in
this field. To bridge this gap, we introduce CharacterEval, a Chinese benchmark
for comprehensive RPCA assessment, complemented by a tailored high-quality
dataset. The dataset comprises 1,785 multi-turn role-playing dialogues,
encompassing 23,020 examples and featuring 77 characters derived from Chinese
novels and scripts. It was carefully constructed, beginning with initial
dialogue extraction via GPT-4, followed by rigorous human-led quality control,
and enhanced with in-depth character profiles sourced from Baidu Baike.
CharacterEval employs a multifaceted evaluation approach, encompassing thirteen
targeted metrics on four dimensions. Comprehensive experiments on CharacterEval
demonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 in
Chinese role-playing conversation. Source code, data source and reward model
will be publicly accessible at https://github.com/morecry/CharacterEval.
Related papers
- CoSER: Coordinating LLM-Based Persona Simulation of Established Roles [62.886267684392635]
CoSER dataset covers 17,966 characters from 771 renowned books.
We develop CoSER 8B and CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models.
arXiv Detail & Related papers (2025-02-13T08:55:24Z) - CharacterBench: Benchmarking Character Customization of Large Language Models [80.29164862682063]
We propose CharacterBench, the largest bilingual generative benchmark, with 22,859 human-annotated samples covering 3,956 characters.
We define 11 dimensions of 6 aspects, classified as sparse and dense dimensions based on whether character features evaluated by specific dimensions manifest in each response.
We also develop CharacterJudge model for cost-effective and stable evaluations.
arXiv Detail & Related papers (2024-12-16T15:55:34Z) - Characteristic AI Agents via Large Language Models [40.10858767752735]
This research focuses on investigating the performance of Large Language Models in constructing characteristic AI agents.
A dataset called Character100'' is built for this benchmark, comprising the most-visited people on Wikipedia for language models to role-play.
The experimental results underscore the potential directions for further improvement in the capabilities of LLMs in constructing characteristic AI agents.
arXiv Detail & Related papers (2024-03-19T02:25:29Z) - AlignBench: Benchmarking Chinese Alignment of Large Language Models [99.24597941555277]
We introduce AlignBench, a comprehensive benchmark for evaluating Chinese Large Language Models' alignment.
We design a human-in-the-loop data curation pipeline, containing eight main categories, 683 real-scenario rooted queries and corresponding human verified references.
For automatic evaluation, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judgecitezheng2023judging approach with Chain-of-Thought to generate explanations and final ratings.
arXiv Detail & Related papers (2023-11-30T17:41:30Z) - Better Zero-Shot Reasoning with Role-Play Prompting [10.90357246745529]
Role-play prompting consistently surpasses the standard zero-shot approach across most datasets.
This highlights its potential to augment the reasoning capabilities of large language models.
arXiv Detail & Related papers (2023-08-15T11:08:30Z) - CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI
Collaboration for Large Language Models [52.25049362267279]
We present a Chinese Bias Benchmark dataset that consists of over 100K questions jointly constructed by human experts and generative language models.
The testing instances in the dataset are automatically derived from 3K+ high-quality templates manually authored with stringent quality control.
Extensive experiments demonstrate the effectiveness of the dataset in detecting model bias, with all 10 publicly available Chinese large language models exhibiting strong bias in certain categories.
arXiv Detail & Related papers (2023-06-28T14:14:44Z) - LLM-powered Data Augmentation for Enhanced Cross-lingual Performance [24.20730298894794]
This paper explores the potential of leveraging Large Language Models (LLMs) for data augmentation in commonsense reasoning datasets.
To achieve this, we utilise several LLMs, namely Dolly-v2, StableVicuna, ChatGPT, and GPT-4, to augment three datasets: XCOPA, XWinograd, and XStoryCloze.
We evaluate the effectiveness of fine-tuning smaller multilingual models, mBERT and XLMR, using the synthesised data.
arXiv Detail & Related papers (2023-05-23T17:33:27Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Large Language Models Meet Harry Potter: A Bilingual Dataset for
Aligning Dialogue Agents with Characters [70.84938803753062]
We introduce the Harry Potter Dialogue dataset, designed to advance the study of dialogue agents and character alignment.
The dataset encompasses all dialogue sessions (in both English and Chinese) from the Harry Potter series.
It is annotated with vital background information, including dialogue scenes, speakers, character relationships, and attributes.
arXiv Detail & Related papers (2022-11-13T10:16:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.