RoleEval: A Bilingual Role Evaluation Benchmark for Large Language
Models
- URL: http://arxiv.org/abs/2312.16132v2
- Date: Fri, 16 Feb 2024 10:02:44 GMT
- Title: RoleEval: A Bilingual Role Evaluation Benchmark for Large Language
Models
- Authors: Tianhao Shen, Sun Li, Quan Tu, Deyi Xiong
- Abstract summary: This paper introduces RoleEval, a benchmark designed to assess the memorization, utilization, and reasoning capabilities of role knowledge.
RoleEval comprises RoleEval-Global and RoleEval-Chinese, with 6,000 Chinese-English parallel multiple-choice questions.
- Score: 44.105939096171454
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The rapid evolution of large language models necessitates effective
benchmarks for evaluating their role knowledge, which is essential for
establishing connections with the real world and providing more immersive
interactions. This paper introduces RoleEval, a bilingual benchmark designed to
assess the memorization, utilization, and reasoning capabilities of role
knowledge. RoleEval comprises RoleEval-Global (including internationally
recognized characters) and RoleEval-Chinese (including characters popular in
China), with 6,000 Chinese-English parallel multiple-choice questions focusing
on 300 influential people and fictional characters drawn from a variety of
domains including celebrities, anime, comics, movies, TV series, games, and
fictions. These questions cover basic knowledge and multi-hop reasoning
abilities, aiming to systematically probe various aspects such as personal
information, relationships, abilities, and experiences of the characters. To
maintain high standards, we perform a hybrid quality check process combining
both automatic and human verification, ensuring that the questions are diverse,
challenging, and discriminative.
Our extensive evaluations with RoleEval across various open-source and
proprietary large language models, under both the zero- and few-shot settings,
reveal insightful findings. Notably, while GPT-4 outperforms other models on
RoleEval-Global, Chinese large language models excel on RoleEval-Chinese,
highlighting significant knowledge distribution differences. We expect that
RoleEval would highlight the significance of assessing role knowledge for large
language models across various languages and cultural settings.
Related papers
- Enhancing LLM-Based Human-Robot Interaction with Nuances for Diversity Awareness [0.0]
This paper presents a system for diversity-aware autonomous conversation leveraging the capabilities of large language models (LLMs)
The system adapts to diverse populations and individuals, considering factors like background, personality, age, gender, and culture.
To assess the system's performance, we conducted both controlled and real-world experiments, measuring a wide range of performance indicators.
arXiv Detail & Related papers (2024-06-25T13:15:36Z) - MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models [65.10456412127405]
MLaKE is a benchmark for the adaptability of knowledge editing methods across five languages.
MLaKE aggregates fact chains from Wikipedia across languages and generates questions in both free-form and multiple-choice.
We evaluate the multilingual knowledge editing generalization capabilities of existing methods on MLaKE.
arXiv Detail & Related papers (2024-04-07T15:23:28Z) - RoleInteract: Evaluating the Social Interaction of Role-Playing Agents [85.6641890712617]
We introduce the first benchmark designed to evaluate the sociality of role-playing conversational agents at both individual and group levels of social interactions.
The benchmark is constructed from a variety of sources and covers a wide range of 500 characters and over 6,000 question prompts.
We find that agents excelling in individual level does not imply their proficiency in group level.
arXiv Detail & Related papers (2024-03-20T15:38:36Z) - Decomposed Prompting: Unveiling Multilingual Linguistic Structure
Knowledge in English-Centric Large Language Models [12.700783525558721]
English-centric Large Language Models (LLMs) like GPT-3 and LLaMA display a remarkable ability to perform multilingual tasks.
This paper introduces the decomposed prompting approach to probe the linguistic structure understanding of these LLMs in sequence labeling tasks.
arXiv Detail & Related papers (2024-02-28T15:15:39Z) - Large Language Models are Superpositions of All Characters: Attaining
Arbitrary Role-play via Self-Alignment [62.898963074989766]
We introduce Ditto, a self-alignment method for role-play.
This method creates a role-play training set comprising 4,000 characters, surpassing the scale of currently available datasets by tenfold.
We present the first comprehensive cross-supervision alignment experiment in the role-play domain.
arXiv Detail & Related papers (2024-01-23T03:56:22Z) - DIALIGHT: Lightweight Multilingual Development and Evaluation of
Task-Oriented Dialogue Systems with Large Language Models [76.79929883963275]
DIALIGHT is a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems.
It features a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level.
Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses.
arXiv Detail & Related papers (2024-01-04T11:27:48Z) - Exploring the Maze of Multilingual Modeling [2.0849578298972835]
We present a comprehensive evaluation of three popular multilingual language models: mBERT, XLM-R, and GPT-3.
Our findings reveal that while the amount of language-specific pretraining data plays a crucial role in model performance, we also identify other factors such as general resource availability, language family, and script type, as important features.
arXiv Detail & Related papers (2023-10-09T04:48:14Z) - M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark
for Chinese Large Language Models [35.17226595231825]
M3KE is a Massive Multi-Level Multi-Subject Knowledge Evaluation benchmark.
It is developed to measure knowledge acquired by Chinese large language models.
We have collected 20,477 questions from 71 tasks.
arXiv Detail & Related papers (2023-05-17T14:56:31Z) - Towards Best Practices for Training Multilingual Dense Retrieval Models [54.91016739123398]
We focus on the task of monolingual retrieval in a variety of typologically diverse languages using one such design.
Our study is organized as a "best practices" guide for training multilingual dense retrieval models.
arXiv Detail & Related papers (2022-04-05T17:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.