C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for
Foundation Models
- URL: http://arxiv.org/abs/2305.08322v3
- Date: Mon, 6 Nov 2023 13:24:16 GMT
- Title: C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for
Foundation Models
- Authors: Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang,
Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu,
Maosong Sun, Junxian He
- Abstract summary: We present C-Eval, the first comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context.
C-Eval comprises multiple-choice questions across four difficulty levels: middle school, high school, college, and professional.
We conduct a comprehensive evaluation of the most advanced LLMs on C-Eval, including both English- and Chinese-oriented models.
- Score: 58.42279750824907
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: New NLP benchmarks are urgently needed to align with the rapid development of
large language models (LLMs). We present C-Eval, the first comprehensive
Chinese evaluation suite designed to assess advanced knowledge and reasoning
abilities of foundation models in a Chinese context. C-Eval comprises
multiple-choice questions across four difficulty levels: middle school, high
school, college, and professional. The questions span 52 diverse disciplines,
ranging from humanities to science and engineering. C-Eval is accompanied by
C-Eval Hard, a subset of very challenging subjects in C-Eval that requires
advanced reasoning abilities to solve. We conduct a comprehensive evaluation of
the most advanced LLMs on C-Eval, including both English- and Chinese-oriented
models. Results indicate that only GPT-4 could achieve an average accuracy of
over 60%, suggesting that there is still significant room for improvement for
current LLMs. We anticipate C-Eval will help analyze important strengths and
shortcomings of foundation models, and foster their development and growth for
Chinese users.
Related papers
- MultiPragEval: Multilingual Pragmatic Evaluation of Large Language Models [0.5822010906632046]
MultiPragEval is a robust test suite designed for the multilingual pragmatic evaluation of LLMs across English, German, Korean, and Chinese.
Our findings demonstrate that Claude3-Opus significantly outperforms other models in all tested languages.
arXiv Detail & Related papers (2024-06-11T21:46:03Z) - FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models [64.11333762954283]
This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs.
We present an extensive evaluation of 12 state-of-the-art LLMs using FoundaBench, employing both traditional assessment methods and our CircularEval protocol to mitigate potential biases in model responses.
Our results highlight the superior performance of models pre-trained on Chinese corpora, and reveal a significant disparity between models' reasoning and memory recall capabilities.
arXiv Detail & Related papers (2024-04-29T01:49:07Z) - COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning [57.600941792026006]
We introduce COIG-CQIA, a high-quality Chinese instruction tuning dataset.
Our aim is to build a diverse, wide-ranging instruction-tuning dataset to better align model behavior with human interactions.
We train models of various scales on different subsets of CQIA, following in-depth evaluation and analyses.
arXiv Detail & Related papers (2024-03-26T19:24:18Z) - LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models [46.77647640464652]
Chinese Large Language Models (LLMs) have recently demonstrated impressive capabilities across various NLP benchmarks and real-world applications.
We propose LHMKE, a Large-scale, Holistic, and Multi-subject Knowledge Evaluation benchmark.
It encompasses 10,465 questions across 75 tasks covering 30 subjects, ranging from primary school to professional certification exams.
arXiv Detail & Related papers (2024-03-19T10:11:14Z) - E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for
Large Language Models [44.74622336775077]
We introduce the E-EVAL, the first comprehensive evaluation benchmark specifically designed for the Chinese K-12 education field.
The E-EVAL consists of 4,351 multiple-choice questions at the primary, middle, and high school levels across a wide range of subjects, including Chinese, English, Politics, History, Ethics, Physics, Chemistry, Mathematics, and Geography.
Findings show that Chinese-dominant models perform well compared to English-dominant models, with many scoring even above the GPT 4.0. However, almost all models perform poorly in complex subjects such as mathematics.
arXiv Detail & Related papers (2024-01-29T07:34:37Z) - LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language.
We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer.
We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z) - CMMLU: Measuring massive multitask language understanding in Chinese [133.70911295934746]
This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities.
CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
arXiv Detail & Related papers (2023-06-15T15:49:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.