RoleEval: A Bilingual Role Evaluation Benchmark for Large Language
Models
- URL: http://arxiv.org/abs/2312.16132v2
- Date: Fri, 16 Feb 2024 10:02:44 GMT
- Title: RoleEval: A Bilingual Role Evaluation Benchmark for Large Language
Models
- Authors: Tianhao Shen, Sun Li, Quan Tu, Deyi Xiong
- Abstract summary: This paper introduces RoleEval, a benchmark designed to assess the memorization, utilization, and reasoning capabilities of role knowledge.
RoleEval comprises RoleEval-Global and RoleEval-Chinese, with 6,000 Chinese-English parallel multiple-choice questions.
- Score: 44.105939096171454
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The rapid evolution of large language models necessitates effective
benchmarks for evaluating their role knowledge, which is essential for
establishing connections with the real world and providing more immersive
interactions. This paper introduces RoleEval, a bilingual benchmark designed to
assess the memorization, utilization, and reasoning capabilities of role
knowledge. RoleEval comprises RoleEval-Global (including internationally
recognized characters) and RoleEval-Chinese (including characters popular in
China), with 6,000 Chinese-English parallel multiple-choice questions focusing
on 300 influential people and fictional characters drawn from a variety of
domains including celebrities, anime, comics, movies, TV series, games, and
fictions. These questions cover basic knowledge and multi-hop reasoning
abilities, aiming to systematically probe various aspects such as personal
information, relationships, abilities, and experiences of the characters. To
maintain high standards, we perform a hybrid quality check process combining
both automatic and human verification, ensuring that the questions are diverse,
challenging, and discriminative.
Our extensive evaluations with RoleEval across various open-source and
proprietary large language models, under both the zero- and few-shot settings,
reveal insightful findings. Notably, while GPT-4 outperforms other models on
RoleEval-Global, Chinese large language models excel on RoleEval-Chinese,
highlighting significant knowledge distribution differences. We expect that
RoleEval would highlight the significance of assessing role knowledge for large
language models across various languages and cultural settings.
Related papers
- Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs [50.0874045899661]
We introduce CharacterBot, a model designed to replicate both the linguistic patterns and distinctive thought processes of a character.
Using Lu Xun as a case study, we propose four training tasks derived from his 17 essay collections.
These include a pre-training task focused on mastering external linguistic structures and knowledge, as well as three fine-tuning tasks.
We evaluate CharacterBot on three tasks for linguistic accuracy and opinion comprehension, demonstrating that it significantly outperforms the baselines on our adapted metrics.
arXiv Detail & Related papers (2025-02-18T16:11:54Z) - Multilingual European Language Models: Benchmarking Approaches and Challenges [2.413212225810367]
generative large language models (LLMs) can solve different tasks through chat interaction.
This paper analyses the benefits and limitations of current evaluation datasets, focusing on multilingual European benchmarks.
We discuss potential solutions to enhance translation quality and cultural biases, including human-in-the-loop verification and iterative translation ranking.
arXiv Detail & Related papers (2025-02-18T14:32:17Z) - Strategic Prompting for Conversational Tasks: A Comparative Analysis of Large Language Models Across Diverse Conversational Tasks [23.34710429552906]
We evaluate the capabilities and limitations of five prevalent Large Language Models: Llama, OPT, Falcon, Alpaca, and MPT.
The study encompasses various conversational tasks, including reservation, empathetic response generation, mental health and legal counseling, persuasion, and negotiation.
arXiv Detail & Related papers (2024-11-26T08:21:24Z) - MINDECHO: Role-Playing Language Agents for Key Opinion Leaders [50.43050502970816]
This paper introduces MINDECHO, a framework for the development and evaluation of Key Opinion Leaders (KOLs)
MINDECHO collects KOL data from Internet video transcripts in various professional fields, and synthesizes their conversations leveraging GPT-4.
Our evaluation covers both general dimensions (ie, knowledge and tones) and fan-centric dimensions for KOLs.
arXiv Detail & Related papers (2024-07-07T09:08:33Z) - Enhancing LLM-Based Human-Robot Interaction with Nuances for Diversity Awareness [0.0]
This paper presents a system for diversity-aware autonomous conversation leveraging the capabilities of large language models (LLMs)
The system adapts to diverse populations and individuals, considering factors like background, personality, age, gender, and culture.
To assess the system's performance, we conducted both controlled and real-world experiments, measuring a wide range of performance indicators.
arXiv Detail & Related papers (2024-06-25T13:15:36Z) - MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models [65.10456412127405]
MLaKE is a benchmark for the adaptability of knowledge editing methods across five languages.
MLaKE aggregates fact chains from Wikipedia across languages and generates questions in both free-form and multiple-choice.
We evaluate the multilingual knowledge editing generalization capabilities of existing methods on MLaKE.
arXiv Detail & Related papers (2024-04-07T15:23:28Z) - Decomposed Prompting: Unveiling Multilingual Linguistic Structure
Knowledge in English-Centric Large Language Models [12.700783525558721]
English-centric Large Language Models (LLMs) like GPT-3 and LLaMA display a remarkable ability to perform multilingual tasks.
This paper introduces the decomposed prompting approach to probe the linguistic structure understanding of these LLMs in sequence labeling tasks.
arXiv Detail & Related papers (2024-02-28T15:15:39Z) - DIALIGHT: Lightweight Multilingual Development and Evaluation of
Task-Oriented Dialogue Systems with Large Language Models [76.79929883963275]
DIALIGHT is a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems.
It features a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level.
Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses.
arXiv Detail & Related papers (2024-01-04T11:27:48Z) - Exploring the Maze of Multilingual Modeling [2.0849578298972835]
We present a comprehensive evaluation of three popular multilingual language models: mBERT, XLM-R, and GPT-3.
Our findings reveal that while the amount of language-specific pretraining data plays a crucial role in model performance, we also identify other factors such as general resource availability, language family, and script type, as important features.
arXiv Detail & Related papers (2023-10-09T04:48:14Z) - Towards Best Practices for Training Multilingual Dense Retrieval Models [54.91016739123398]
We focus on the task of monolingual retrieval in a variety of typologically diverse languages using one such design.
Our study is organized as a "best practices" guide for training multilingual dense retrieval models.
arXiv Detail & Related papers (2022-04-05T17:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.