E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for
Large Language Models
- URL: http://arxiv.org/abs/2401.15927v1
- Date: Mon, 29 Jan 2024 07:34:37 GMT
- Title: E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for
Large Language Models
- Authors: Jinchang Hou, Chang Ao, Haihong Wu, Xiangtao Kong, Zhigang Zheng,
Daijia Tang, Chengming Li, Xiping Hu, Ruifeng Xu, Shiwen Ni, Min Yang
- Abstract summary: We introduce the E-EVAL, the first comprehensive evaluation benchmark specifically designed for the Chinese K-12 education field.
The E-EVAL consists of 4,351 multiple-choice questions at the primary, middle, and high school levels across a wide range of subjects, including Chinese, English, Politics, History, Ethics, Physics, Chemistry, Mathematics, and Geography.
Findings show that Chinese-dominant models perform well compared to English-dominant models, with many scoring even above the GPT 4.0. However, almost all models perform poorly in complex subjects such as mathematics.
- Score: 44.74622336775077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the accelerating development of Large Language Models (LLMs), many LLMs
are beginning to be used in the Chinese K-12 education domain. The integration
of LLMs and education is getting closer and closer, however, there is currently
no benchmark for evaluating LLMs that focuses on the Chinese K-12 education
domain. Therefore, there is an urgent need for a comprehensive natural language
processing benchmark to accurately assess the capabilities of various LLMs in
the Chinese K-12 education domain. To address this, we introduce the E-EVAL,
the first comprehensive evaluation benchmark specifically designed for the
Chinese K-12 education field. The E-EVAL consists of 4,351 multiple-choice
questions at the primary, middle, and high school levels across a wide range of
subjects, including Chinese, English, Politics, History, Ethics, Physics,
Chemistry, Mathematics, and Geography. We conducted a comprehensive evaluation
of E-EVAL on advanced LLMs, including both English-dominant and
Chinese-dominant models. Findings show that Chinese-dominant models perform
well compared to English-dominant models, with many scoring even above the GPT
4.0. However, almost all models perform poorly in complex subjects such as
mathematics. We also found that most Chinese-dominant LLMs did not achieve
higher scores at the primary school level compared to the middle school level.
We observe that the mastery of higher-order knowledge by the model does not
necessarily imply the mastery of lower-order knowledge as well. Additionally,
the experimental results indicate that the Chain of Thought (CoT) technique is
effective only for the challenging science subjects, while Few-shot prompting
is more beneficial for liberal arts subjects. With E-EVAL, we aim to analyze
the strengths and limitations of LLMs in educational applications, and to
contribute to the progress and development of Chinese K-12 education and LLMs.
Related papers
- MILU: A Multi-task Indic Language Understanding Benchmark [7.652738829153342]
Existing benchmarks predominantly focus on English, leaving substantial gaps in assessing Large Language Models in Indic languages.
We introduce MILU, a comprehensive evaluation benchmark designed to address this gap.
With an India-centric design, MILU incorporates material from regional and state-level examinations, covering topics such as local history, arts, festivals, and laws, alongside standard subjects like science and mathematics.
arXiv Detail & Related papers (2024-11-04T19:17:17Z) - Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models [9.761584874383873]
We present Edu-Values, the first Chinese education values evaluation benchmark designed to measure large language models' alignment ability.
We meticulously design and compile 1,418 questions, including multiple-choice, multi-modal question answering, subjective analysis, adversarial prompts, and questions on traditional Chinese culture.
Due to differences in educational culture, Chinese LLMs significantly outperform English LLMs, with Qwen 2 ranking the first with a score of 81.37.
arXiv Detail & Related papers (2024-09-19T13:02:54Z) - LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models [46.77647640464652]
Chinese Large Language Models (LLMs) have recently demonstrated impressive capabilities across various NLP benchmarks and real-world applications.
We propose LHMKE, a Large-scale, Holistic, and Multi-subject Knowledge Evaluation benchmark.
It encompasses 10,465 questions across 75 tasks covering 30 subjects, ranging from primary school to professional certification exams.
arXiv Detail & Related papers (2024-03-19T10:11:14Z) - Analyzing and Adapting Large Language Models for Few-Shot Multilingual
NLU: Are We There Yet? [82.02076369811402]
Supervised fine-tuning (SFT), supervised instruction tuning (SIT) and in-context learning (ICL) are three alternative, de facto standard approaches to few-shot learning.
We present an extensive and systematic comparison of the three approaches, testing them on 6 high- and low-resource languages, three different NLU tasks, and a myriad of language and domain setups.
Our observations show that supervised instruction tuning has the best trade-off between performance and resource requirements.
arXiv Detail & Related papers (2024-03-04T10:48:13Z) - LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language.
We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer.
We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z) - CMMLU: Measuring massive multitask language understanding in Chinese [133.70911295934746]
This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities.
CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
arXiv Detail & Related papers (2023-06-15T15:49:51Z) - C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for
Foundation Models [58.42279750824907]
We present C-Eval, the first comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context.
C-Eval comprises multiple-choice questions across four difficulty levels: middle school, high school, college, and professional.
We conduct a comprehensive evaluation of the most advanced LLMs on C-Eval, including both English- and Chinese-oriented models.
arXiv Detail & Related papers (2023-05-15T03:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.