RoleEval: A Bilingual Role Evaluation Benchmark for Large Language
Models
- URL: http://arxiv.org/abs/2312.16132v2
- Date: Fri, 16 Feb 2024 10:02:44 GMT
- Title: RoleEval: A Bilingual Role Evaluation Benchmark for Large Language
Models
- Authors: Tianhao Shen, Sun Li, Quan Tu, Deyi Xiong
- Abstract summary: This paper introduces RoleEval, a benchmark designed to assess the memorization, utilization, and reasoning capabilities of role knowledge.
RoleEval comprises RoleEval-Global and RoleEval-Chinese, with 6,000 Chinese-English parallel multiple-choice questions.
- Score: 44.105939096171454
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The rapid evolution of large language models necessitates effective
benchmarks for evaluating their role knowledge, which is essential for
establishing connections with the real world and providing more immersive
interactions. This paper introduces RoleEval, a bilingual benchmark designed to
assess the memorization, utilization, and reasoning capabilities of role
knowledge. RoleEval comprises RoleEval-Global (including internationally
recognized characters) and RoleEval-Chinese (including characters popular in
China), with 6,000 Chinese-English parallel multiple-choice questions focusing
on 300 influential people and fictional characters drawn from a variety of
domains including celebrities, anime, comics, movies, TV series, games, and
fictions. These questions cover basic knowledge and multi-hop reasoning
abilities, aiming to systematically probe various aspects such as personal
information, relationships, abilities, and experiences of the characters. To
maintain high standards, we perform a hybrid quality check process combining
both automatic and human verification, ensuring that the questions are diverse,
challenging, and discriminative.
Our extensive evaluations with RoleEval across various open-source and
proprietary large language models, under both the zero- and few-shot settings,
reveal insightful findings. Notably, while GPT-4 outperforms other models on
RoleEval-Global, Chinese large language models excel on RoleEval-Chinese,
highlighting significant knowledge distribution differences. We expect that
RoleEval would highlight the significance of assessing role knowledge for large
language models across various languages and cultural settings.
Related papers
- Strategic Prompting for Conversational Tasks: A Comparative Analysis of Large Language Models Across Diverse Conversational Tasks [21.079199282600907]
We evaluate the capabilities and limitations of five prevalent Large Language Models: Llama, OPT, Falcon, Alpaca, and MPT.
The study encompasses various conversational tasks, including reservation, empathetic response generation, mental health and legal counseling, persuasion, and negotiation.
arXiv Detail & Related papers (2024-11-26T08:21:24Z) - Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models [0.0]
The quality of tokenization can significantly impact a model's ability to handle diverse languages effectively.
We introduce Qtok, a tool designed to assess tokenizer quality with a specific emphasis on their performance in multilingual contexts.
Qtok applies these metrics to evaluate 13 distinct tokenizers from 58 publicly available models, analyzing their output across different linguistic contexts.
arXiv Detail & Related papers (2024-10-16T19:34:34Z) - MINDECHO: Role-Playing Language Agents for Key Opinion Leaders [50.43050502970816]
This paper introduces MINDECHO, a framework for the development and evaluation of Key Opinion Leaders (KOLs)
MINDECHO collects KOL data from Internet video transcripts in various professional fields, and synthesizes their conversations leveraging GPT-4.
Our evaluation covers both general dimensions (ie, knowledge and tones) and fan-centric dimensions for KOLs.
arXiv Detail & Related papers (2024-07-07T09:08:33Z) - Enhancing LLM-Based Human-Robot Interaction with Nuances for Diversity Awareness [0.0]
This paper presents a system for diversity-aware autonomous conversation leveraging the capabilities of large language models (LLMs)
The system adapts to diverse populations and individuals, considering factors like background, personality, age, gender, and culture.
To assess the system's performance, we conducted both controlled and real-world experiments, measuring a wide range of performance indicators.
arXiv Detail & Related papers (2024-06-25T13:15:36Z) - MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models [65.10456412127405]
MLaKE is a benchmark for the adaptability of knowledge editing methods across five languages.
MLaKE aggregates fact chains from Wikipedia across languages and generates questions in both free-form and multiple-choice.
We evaluate the multilingual knowledge editing generalization capabilities of existing methods on MLaKE.
arXiv Detail & Related papers (2024-04-07T15:23:28Z) - Decomposed Prompting: Unveiling Multilingual Linguistic Structure
Knowledge in English-Centric Large Language Models [12.700783525558721]
English-centric Large Language Models (LLMs) like GPT-3 and LLaMA display a remarkable ability to perform multilingual tasks.
This paper introduces the decomposed prompting approach to probe the linguistic structure understanding of these LLMs in sequence labeling tasks.
arXiv Detail & Related papers (2024-02-28T15:15:39Z) - Large Language Models are Superpositions of All Characters: Attaining
Arbitrary Role-play via Self-Alignment [62.898963074989766]
We introduce Ditto, a self-alignment method for role-play.
This method creates a role-play training set comprising 4,000 characters, surpassing the scale of currently available datasets by tenfold.
We present the first comprehensive cross-supervision alignment experiment in the role-play domain.
arXiv Detail & Related papers (2024-01-23T03:56:22Z) - DIALIGHT: Lightweight Multilingual Development and Evaluation of
Task-Oriented Dialogue Systems with Large Language Models [76.79929883963275]
DIALIGHT is a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems.
It features a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level.
Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses.
arXiv Detail & Related papers (2024-01-04T11:27:48Z) - Exploring the Maze of Multilingual Modeling [2.0849578298972835]
We present a comprehensive evaluation of three popular multilingual language models: mBERT, XLM-R, and GPT-3.
Our findings reveal that while the amount of language-specific pretraining data plays a crucial role in model performance, we also identify other factors such as general resource availability, language family, and script type, as important features.
arXiv Detail & Related papers (2023-10-09T04:48:14Z) - Towards Best Practices for Training Multilingual Dense Retrieval Models [54.91016739123398]
We focus on the task of monolingual retrieval in a variety of typologically diverse languages using one such design.
Our study is organized as a "best practices" guide for training multilingual dense retrieval models.
arXiv Detail & Related papers (2022-04-05T17:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.