Related papers: Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

URL: http://arxiv.org/abs/2411.07140v2
Date: Wed, 13 Nov 2024 16:27:43 GMT
Title: Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models
Authors: Yancheng He, Shilong Li, Jiaheng Liu, Yingshui Tan, Weixun Wang, Hui Huang, Xingyuan Bu, Hangyu Guo, Chengwei Hu, Boren Zheng, Zhuoran Lin, Xuepeng Liu, Dekai Sun, Shirong Lin, Zhicheng Zheng, Xiaoyong Zhu, Wenbo Su, Bo Zheng,
Abstract summary: Chinese SimpleQA is the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions. We focus on the Chinese language over 6 major topics with 99 diverse subtopics.
Score: 24.47838086336772
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: New LLM evaluation benchmarks are important to align with the rapid development of Large Language Models (LLMs). In this work, we present Chinese SimpleQA, the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions, and Chinese SimpleQA mainly has five properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate). Specifically, first, we focus on the Chinese language over 6 major topics with 99 diverse subtopics. Second, we conduct a comprehensive quality control process to achieve high-quality questions and answers, where the reference answers are static and cannot be changed over time. Third, following SimpleQA, the questions and answers are very short, and the grading process is easy-to-evaluate based on OpenAI API. Based on Chinese SimpleQA, we perform a comprehensive evaluation on the factuality abilities of existing LLMs. Finally, we hope that Chinese SimpleQA could guide the developers to better understand the Chinese factuality abilities of their models and facilitate the growth of foundation models.

Related papers

Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese [52.98034458924209]
This study investigates whether Large Language Models exhibit differential performance when prompted in two variants of written Chinese.<n>We design two benchmark tasks that reflect real-world scenarios: regional term choice and regional name choice.<n>Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language.
arXiv Detail & Related papers (2025-05-28T17:56:49Z)
ChineseSimpleVQA -- "See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models [38.921977141721605]
We introduce the first factuality-based visual question-answering benchmark in Chinese, named ChineseSimpleVQA. Key features of this benchmark include a focus on the Chinese language, diverse knowledge types, a multi-hop question construction, high-quality data, static consistency, and easy-to-evaluate through short answers.
arXiv Detail & Related papers (2025-02-17T12:02:23Z)
TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish [54.51310112013655]
We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU. TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula. We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models.
arXiv Detail & Related papers (2024-07-17T08:28:55Z)
Measuring Taiwanese Mandarin Language Understanding [24.581360653015423]
We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in large language models (LLMs) TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels.
arXiv Detail & Related papers (2024-03-29T13:56:21Z)
Let LLMs Take on the Latest Challenges! A Chinese Dynamic Question Answering Benchmark [69.3415799675046]
We introduce CDQA, a Chinese Dynamic QA benchmark containing question-answer pairs related to the latest news on the Chinese Internet. We obtain high-quality data through a pipeline that combines humans and models. We have also evaluated and analyzed mainstream and advanced Chinese LLMs on CDQA.
arXiv Detail & Related papers (2024-02-29T15:22:13Z)
CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances. To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z)
CMMLU: Measuring massive multitask language understanding in Chinese [133.70911295934746]
This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities. CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
arXiv Detail & Related papers (2023-06-15T15:49:51Z)
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models [76.88692952308084]
M3Exam is a benchmark for evaluating large language models (LLMs) in a multilingual, multimodal, and multilevel context. M3Exam contains 12,317 questions in 9 diverse languages with three educational levels. We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text.
arXiv Detail & Related papers (2023-06-08T13:21:29Z)
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models [58.42279750824907]
We present C-Eval, the first comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context. C-Eval comprises multiple-choice questions across four difficulty levels: middle school, high school, college, and professional. We conduct a comprehensive evaluation of the most advanced LLMs on C-Eval, including both English- and Chinese-oriented models.
arXiv Detail & Related papers (2023-05-15T03:20:19Z)
Intrinsic Knowledge Evaluation on Chinese Language Models [5.293979881130493]
This paper proposes four tasks on syntactic, semantic, commonsense, and factual knowledge, aggregating to a total of $39,308$ questions. Our probes and knowledge data prove to be a reliable benchmark for evaluating pre-trained Chinese LMs.
arXiv Detail & Related papers (2020-11-29T04:34:39Z)
CLUE: A Chinese Language Understanding Evaluation Benchmark [41.86950255312653]
We introduce the first large-scale Chinese Language Understanding Evaluation (CLUE) benchmark. CLUE brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension. We report scores using an exhaustive set of current state-of-the-art pre-trained Chinese models.
arXiv Detail & Related papers (2020-04-13T15:02:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.