Related papers: RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models

RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models

URL: http://arxiv.org/abs/2312.16132v2
Date: Fri, 16 Feb 2024 10:02:44 GMT
Title: RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models
Authors: Tianhao Shen, Sun Li, Quan Tu, Deyi Xiong
Abstract summary: This paper introduces RoleEval, a benchmark designed to assess the memorization, utilization, and reasoning capabilities of role knowledge. RoleEval comprises RoleEval-Global and RoleEval-Chinese, with 6,000 Chinese-English parallel multiple-choice questions.
Score: 44.105939096171454
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The rapid evolution of large language models necessitates effective benchmarks for evaluating their role knowledge, which is essential for establishing connections with the real world and providing more immersive interactions. This paper introduces RoleEval, a bilingual benchmark designed to assess the memorization, utilization, and reasoning capabilities of role knowledge. RoleEval comprises RoleEval-Global (including internationally recognized characters) and RoleEval-Chinese (including characters popular in China), with 6,000 Chinese-English parallel multiple-choice questions focusing on 300 influential people and fictional characters drawn from a variety of domains including celebrities, anime, comics, movies, TV series, games, and fictions. These questions cover basic knowledge and multi-hop reasoning abilities, aiming to systematically probe various aspects such as personal information, relationships, abilities, and experiences of the characters. To maintain high standards, we perform a hybrid quality check process combining both automatic and human verification, ensuring that the questions are diverse, challenging, and discriminative. Our extensive evaluations with RoleEval across various open-source and proprietary large language models, under both the zero- and few-shot settings, reveal insightful findings. Notably, while GPT-4 outperforms other models on RoleEval-Global, Chinese large language models excel on RoleEval-Chinese, highlighting significant knowledge distribution differences. We expect that RoleEval would highlight the significance of assessing role knowledge for large language models across various languages and cultural settings.

Related papers

Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs [50.0874045899661]
We introduce CharacterBot, a model designed to replicate both the linguistic patterns and distinctive thought processes of a character. Using Lu Xun as a case study, we propose four training tasks derived from his 17 essay collections. These include a pre-training task focused on mastering external linguistic structures and knowledge, as well as three fine-tuning tasks. We evaluate CharacterBot on three tasks for linguistic accuracy and opinion comprehension, demonstrating that it significantly outperforms the baselines on our adapted metrics.
arXiv Detail & Related papers (2025-02-18T16:11:54Z)
Multilingual European Language Models: Benchmarking Approaches and Challenges [2.413212225810367]
generative large language models (LLMs) can solve different tasks through chat interaction. This paper analyses the benefits and limitations of current evaluation datasets, focusing on multilingual European benchmarks. We discuss potential solutions to enhance translation quality and cultural biases, including human-in-the-loop verification and iterative translation ranking.
arXiv Detail & Related papers (2025-02-18T14:32:17Z)
Strategic Prompting for Conversational Tasks: A Comparative Analysis of Large Language Models Across Diverse Conversational Tasks [21.079199282600907]
We evaluate the capabilities and limitations of five prevalent Large Language Models: Llama, OPT, Falcon, Alpaca, and MPT. The study encompasses various conversational tasks, including reservation, empathetic response generation, mental health and legal counseling, persuasion, and negotiation.
arXiv Detail & Related papers (2024-11-26T08:21:24Z)
Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models [0.0]
The quality of tokenization can significantly impact a model's ability to handle diverse languages effectively. We introduce Qtok, a tool designed to assess tokenizer quality with a specific emphasis on their performance in multilingual contexts. Qtok applies these metrics to evaluate 13 distinct tokenizers from 58 publicly available models, analyzing their output across different linguistic contexts.
arXiv Detail & Related papers (2024-10-16T19:34:34Z)
MINDECHO: Role-Playing Language Agents for Key Opinion Leaders [50.43050502970816]
This paper introduces MINDECHO, a framework for the development and evaluation of Key Opinion Leaders (KOLs) MINDECHO collects KOL data from Internet video transcripts in various professional fields, and synthesizes their conversations leveraging GPT-4. Our evaluation covers both general dimensions (ie, knowledge and tones) and fan-centric dimensions for KOLs.
arXiv Detail & Related papers (2024-07-07T09:08:33Z)
Enhancing LLM-Based Human-Robot Interaction with Nuances for Diversity Awareness [0.0]
This paper presents a system for diversity-aware autonomous conversation leveraging the capabilities of large language models (LLMs) The system adapts to diverse populations and individuals, considering factors like background, personality, age, gender, and culture. To assess the system's performance, we conducted both controlled and real-world experiments, measuring a wide range of performance indicators.
arXiv Detail & Related papers (2024-06-25T13:15:36Z)
MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models [65.10456412127405]
MLaKE is a benchmark for the adaptability of knowledge editing methods across five languages. MLaKE aggregates fact chains from Wikipedia across languages and generates questions in both free-form and multiple-choice. We evaluate the multilingual knowledge editing generalization capabilities of existing methods on MLaKE.
arXiv Detail & Related papers (2024-04-07T15:23:28Z)
Decomposed Prompting: Unveiling Multilingual Linguistic Structure Knowledge in English-Centric Large Language Models [12.700783525558721]
English-centric Large Language Models (LLMs) like GPT-3 and LLaMA display a remarkable ability to perform multilingual tasks. This paper introduces the decomposed prompting approach to probe the linguistic structure understanding of these LLMs in sequence labeling tasks.
arXiv Detail & Related papers (2024-02-28T15:15:39Z)
Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment [62.898963074989766]
We introduce Ditto, a self-alignment method for role-play. This method creates a role-play training set comprising 4,000 characters, surpassing the scale of currently available datasets by tenfold. We present the first comprehensive cross-supervision alignment experiment in the role-play domain.
arXiv Detail & Related papers (2024-01-23T03:56:22Z)
DIALIGHT: Lightweight Multilingual Development and Evaluation of Task-Oriented Dialogue Systems with Large Language Models [76.79929883963275]
DIALIGHT is a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems. It features a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level. Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses.
arXiv Detail & Related papers (2024-01-04T11:27:48Z)
Exploring the Maze of Multilingual Modeling [2.0849578298972835]
We present a comprehensive evaluation of three popular multilingual language models: mBERT, XLM-R, and GPT-3. Our findings reveal that while the amount of language-specific pretraining data plays a crucial role in model performance, we also identify other factors such as general resource availability, language family, and script type, as important features.
arXiv Detail & Related papers (2023-10-09T04:48:14Z)
Towards Best Practices for Training Multilingual Dense Retrieval Models [54.91016739123398]
We focus on the task of monolingual retrieval in a variety of typologically diverse languages using one such design. Our study is organized as a "best practices" guide for training multilingual dense retrieval models.
arXiv Detail & Related papers (2022-04-05T17:12:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.