Related papers: CharacterBench: Benchmarking Character Customization of Large Language Models

CharacterBench: Benchmarking Character Customization of Large Language Models

URL: http://arxiv.org/abs/2412.11912v1
Date: Mon, 16 Dec 2024 15:55:34 GMT
Title: CharacterBench: Benchmarking Character Customization of Large Language Models
Authors: Jinfeng Zhou, Yongkang Huang, Bosi Wen, Guanqun Bi, Yuxuan Chen, Pei Ke, Zhuang Chen, Xiyao Xiao, Libiao Peng, Kuntian Tang, Rongsheng Zhang, Le Zhang, Tangjie Lv, Zhipeng Hu, Hongning Wang, Minlie Huang,
Abstract summary: We propose CharacterBench, the largest bilingual generative benchmark, with 22,859 human-annotated samples covering 3,956 characters.<n>We define 11 dimensions of 6 aspects, classified as sparse and dense dimensions based on whether character features evaluated by specific dimensions manifest in each response.<n>We also develop CharacterJudge model for cost-effective and stable evaluations.
Score: 80.29164862682063
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Character-based dialogue (aka role-playing) enables users to freely customize characters for interaction, which often relies on LLMs, raising the need to evaluate LLMs' character customization capability. However, existing benchmarks fail to ensure a robust evaluation as they often only involve a single character category or evaluate limited dimensions. Moreover, the sparsity of character features in responses makes feature-focused generative evaluation both ineffective and inefficient. To address these issues, we propose CharacterBench, the largest bilingual generative benchmark, with 22,859 human-annotated samples covering 3,956 characters from 25 detailed character categories. We define 11 dimensions of 6 aspects, classified as sparse and dense dimensions based on whether character features evaluated by specific dimensions manifest in each response. We enable effective and efficient evaluation by crafting tailored queries for each dimension to induce characters' responses related to specific dimensions. Further, we develop CharacterJudge model for cost-effective and stable evaluations. Experiments show its superiority over SOTA automatic judges (e.g., GPT-4) and our benchmark's potential to optimize LLMs' character customization. Our repository is at https://github.com/thu-coai/CharacterBench.

Related papers

CoSER: Coordinating LLM-Based Persona Simulation of Established Roles [62.886267684392635]
CoSER dataset covers 17,966 characters from 771 renowned books. We develop CoSER 8B and CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models.
arXiv Detail & Related papers (2025-02-13T08:55:24Z)
CharacterBox: Evaluating the Role-Playing Capabilities of LLMs in Text-Based Virtual Worlds [74.02480671181685]
Role-playing is a crucial capability of Large Language Models (LLMs)<n>Current evaluation methods fall short of adequately capturing the nuanced character traits and behaviors essential for authentic role-playing.<n>We propose CharacterBox, a simulation sandbox designed to generate situational fine-grained character behavior trajectories.
arXiv Detail & Related papers (2024-12-07T12:09:35Z)
CHATTER: A Character Attribution Dataset for Narrative Understanding [31.540540919042154]
We validate a subset of CHATTER, called CHATTEREVAL, using human annotations to serve as a benchmark to evaluate the character attribution task in movie scripts. evaldataset also assesses narrative understanding and the long-context modeling capacity of language models.
arXiv Detail & Related papers (2024-11-07T22:37:30Z)
PersoBench: Benchmarking Personalized Response Generation in Large Language Models [6.8046587254152735]
We present a new benchmark, PersoBench, to evaluate the personalization ability of large language models (LLMs) in persona-aware dialogue generation. Our analysis, conducted on three well-known persona-aware datasets, evaluates multiple dimensions of response quality, including fluency, diversity, coherence, and personalization.
arXiv Detail & Related papers (2024-10-04T07:29:41Z)
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z)
Evaluating Character Understanding of Large Language Models via Character Profiling from Fictional Works [33.817319226631426]
Large language models (LLMs) have demonstrated impressive performance and spurred numerous AI applications. The prerequisite for these RPAs lies in the capability of LLMs to understand characters from fictional works. Previous efforts have evaluated this capability via basic classification tasks or characteristic imitation.
arXiv Detail & Related papers (2024-04-19T09:10:29Z)
CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation [20.31635307504041]
We introduce CharacterEval, a Chinese benchmark for comprehensive RPCA assessment. The dataset comprises 1,785 multi-turn role-playing dialogues, encompassing 23,020 examples and featuring 77 characters. CharacterEval employs a multifaceted evaluation approach, encompassing thirteen targeted metrics on four dimensions.
arXiv Detail & Related papers (2024-01-02T16:20:40Z)
CharacterGLM: Customizing Chinese Conversational AI Characters with Large Language Models [66.4382820107453]
We present CharacterGLM, a series of models built upon ChatGLM, with model sizes ranging from 6B to 66B parameters. Our CharacterGLM is designed for generating Character-based Dialogues (CharacterDial), which aims to equip a conversational AI system with character customization for satisfying people's inherent social desires and emotional needs.
arXiv Detail & Related papers (2023-11-28T14:49:23Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
Large Language Models Meet Harry Potter: A Bilingual Dataset for Aligning Dialogue Agents with Characters [70.84938803753062]
We introduce the Harry Potter Dialogue dataset, designed to advance the study of dialogue agents and character alignment. The dataset encompasses all dialogue sessions (in both English and Chinese) from the Harry Potter series. It is annotated with vital background information, including dialogue scenes, speakers, character relationships, and attributes.
arXiv Detail & Related papers (2022-11-13T10:16:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.