FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models
- URL: http://arxiv.org/abs/2504.14690v1
- Date: Sun, 20 Apr 2025 17:43:47 GMT
- Title: FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models
- Authors: Mehrnoush Shamsfard, Zahra Saaberi, Mostafa Karimi manesh, Seyed Mohammad Hossein Hashemi, Zahra Vatankhah, Motahareh Ramezani, Niki Pourazin, Tara Zare, Maryam Azimi, Sarina Chitsaz, Sama Khoraminejad, Morteza Mahdavi Mortazavi, Mohammad Mahdi Chizari, Sahar Maleki, Seyed Soroush Majd, Mostafa Masumi, Sayed Ali Musavi Khoeini, Amir Mohseni, Sogol Alipour,
- Abstract summary: This paper introduces FarsEval-PKBETS benchmark, a subset of FarsEval project for evaluating large language models in Persian.<n>This benchmark consists of 4000 questions and answers in various formats, including multiple choice, short answer and descriptive responses.<n>It covers a wide range of domains and tasks, including medicine, law, religion, Persian language, encyclopedic knowledge, human preferences, social knowledge, ethics and bias, text generation, and respecting others' rights.
- Score: 0.5221124918965586
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Research on evaluating and analyzing large language models (LLMs) has been extensive for resource-rich languages such as English, yet their performance in languages such as Persian has received considerably less attention. This paper introduces FarsEval-PKBETS benchmark, a subset of FarsEval project for evaluating large language models in Persian. This benchmark consists of 4000 questions and answers in various formats, including multiple choice, short answer and descriptive responses. It covers a wide range of domains and tasks,including medicine, law, religion, Persian language, encyclopedic knowledge, human preferences, social knowledge, ethics and bias, text generation, and respecting others' rights. This bechmark incorporates linguistics, cultural, and local considerations relevant to the Persian language and Iran. To ensure the questions are challenging for current LLMs, three models -- Llama3-70B, PersianMind, and Dorna -- were evaluated using this benchmark. Their average accuracy was below 50%, meaning they provided fully correct answers to fewer than half of the questions. These results indicate that current language models are still far from being able to solve this benchmark
Related papers
- MILU: A Multi-task Indic Language Understanding Benchmark [7.652738829153342]
We introduce MILU, a comprehensive evaluation benchmark designed to assess Large Language Models in Indic languages.<n>With an India-centric design, MILU incorporates material from regional and state-level examinations, covering topics such as local history, arts, festivals, and laws, alongside standard subjects like science and mathematics.<n>Open multilingual models outperform language-specific fine-tuned models, which perform only slightly better than random baselines.
arXiv Detail & Related papers (2024-11-04T19:17:17Z) - One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [68.33068005789116]
We present the first study aimed at objectively assessing the fairness and robustness of Large Language Models (LLMs) in handling dialects in canonical reasoning tasks.<n>We hire AAVE speakers, including experts with computer science backgrounds, to rewrite seven popular benchmarks, such as HumanEval and GSM8K.<n>Our findings reveal that textbfalmost all of these widely used models show significant brittleness and unfairness to queries in AAVE.
arXiv Detail & Related papers (2024-10-14T18:44:23Z) - The Qiyas Benchmark: Measuring ChatGPT Mathematical and Language Understanding in Arabic [0.0]
We introduce two novel benchmarks designed to evaluate models' mathematical reasoning and language understanding abilities in Arabic.
These benchmarks are derived from a General Aptitude Test (GAT) called Qiyas exam, a standardized test widely used for university admissions in Saudi Arabia.
arXiv Detail & Related papers (2024-06-28T16:34:31Z) - CaLMQA: Exploring culturally specific long-form question answering across 23 languages [58.18984409715615]
CaLMQA is a collection of 1.5K culturally specific questions spanning 23 languages and 51 culturally translated questions from English into 22 other languages.
We collect naturally-occurring questions from community web forums and hire native speakers to write questions to cover under-studied languages such as Fijian and Kirundi.
Our dataset contains diverse, complex questions that reflect cultural topics (e.g. traditions, laws, news) and the language usage of native speakers.
arXiv Detail & Related papers (2024-06-25T17:45:26Z) - Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language? [3.4812080203308984]
Khayyam Challenge (also known as PersianMMLU) is a collection of 20,192 four-choice questions sourced from 38 diverse tasks extracted from Persian examinations.
The primary objective of the Khayyam Challenge is to facilitate the rigorous evaluation of LLMs that support the Persian language.
arXiv Detail & Related papers (2024-04-09T22:38:13Z) - Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT [4.574416868427695]
This paper explores the efficacy of large language models (LLMs) for Persian.
We present the first comprehensive benchmarking study of LLMs across diverse Persian language tasks.
arXiv Detail & Related papers (2024-04-03T02:12:29Z) - Instruction-Following Evaluation for Large Language Models [52.90926820437014]
We introduce Instruction-Following Eval (IFEval) for large language models.
IFEval is a straightforward and easy-to-reproduce evaluation benchmark.
We show evaluation results of two widely available LLMs on the market.
arXiv Detail & Related papers (2023-11-14T05:13:55Z) - Evaluating and Modeling Attribution for Cross-Lingual Question Answering [80.4807682093432]
This work is the first to study attribution for cross-lingual question answering.
We collect data in 5 languages to assess the attribution level of a state-of-the-art cross-lingual QA system.
We find that a substantial portion of the answers is not attributable to any retrieved passages.
arXiv Detail & Related papers (2023-05-23T17:57:46Z) - SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.
SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality.
We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Multilingual Answer Sentence Reranking via Automatically Translated Data [97.98885151955467]
We present a study on the design of multilingual Answer Sentence Selection (AS2) models, which are a core component of modern Question Answering (QA) systems.
The main idea is to transfer data, created from one resource rich language, e.g., English, to other languages, less rich in terms of resources.
arXiv Detail & Related papers (2021-02-20T03:52:08Z) - ParsiNLU: A Suite of Language Understanding Challenges for Persian [23.26176232463948]
This work focuses on Persian language, one of the widely spoken languages in the world.
There are few NLU datasets available for this rich language.
ParsiNLU is the first benchmark in Persian language that includes a range of high-level tasks.
arXiv Detail & Related papers (2020-12-11T06:31:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.