Related papers: EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education

EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education

URL: http://arxiv.org/abs/2512.00290v1
Date: Sat, 29 Nov 2025 03:09:50 GMT
Title: EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education
Authors: Guoqing Ma, Jia Zhu, Hanghui Guo, Weijie Shi, Yue Cui, Jiawei Shen, Zilong Li, Yidan Liang,
Abstract summary: We introduce EduEval, a comprehensive hierarchical benchmark for evaluating large language models (LLMs) in Chinese K-12 education.<n>EduEval comprises 24 distinct task types with over 11,000 questions spanning primary to high school levels.<n>We evaluate 14 leading LLMs under both zero-shot and few-shot settings, revealing that while models perform well on factual tasks, they struggle with classroom dialogue classification and exhibit inconsistent results in creative content generation.
Score: 11.130206904690745
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) demonstrate significant potential for educational applications. However, their unscrutinized deployment poses risks to educational standards, underscoring the need for rigorous evaluation. We introduce EduEval, a comprehensive hierarchical benchmark for evaluating LLMs in Chinese K-12 education. This benchmark makes three key contributions: (1) Cognitive Framework: We propose the EduAbility Taxonomy, which unifies Bloom's Taxonomy and Webb's Depth of Knowledge to organize tasks across six cognitive dimensions including Memorization, Understanding, Application, Reasoning, Creativity, and Ethics. (2) Authenticity: Our benchmark integrates real exam questions, classroom conversation, student essays, and expert-designed prompts to reflect genuine educational challenges; (3) Scale: EduEval comprises 24 distinct task types with over 11,000 questions spanning primary to high school levels. We evaluate 14 leading LLMs under both zero-shot and few-shot settings, revealing that while models perform well on factual tasks, they struggle with classroom dialogue classification and exhibit inconsistent results in creative content generation. Interestingly, several open source models outperform proprietary systems on complex educational reasoning. Few-shot prompting shows varying effectiveness across cognitive dimensions, suggesting that different educational objectives require tailored approaches. These findings provide targeted benchmarking metrics for developing LLMs specifically optimized for diverse Chinese educational tasks.

Related papers

OpenLearnLM Benchmark: A Unified Framework for Evaluating Knowledge, Skill, and Attitude in Educational Large Language Models [1.1375020040227939]
OpenLearnLM Benchmark is a framework evaluating large language models.<n>Our benchmark comprises 124K+ items spanning multiple subjects, educational roles, and difficulty levels.
arXiv Detail & Related papers (2026-01-20T11:53:31Z)
PustakAI: Curriculum-Aligned and Interactive Textbooks Using Large Language Models [4.419156740280761]
Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like content.<n>We present the framework "PustakAI"footnotePustak means book' in many Indian languages.<n>We evaluate the dataset with various prompting techniques, such as meta-prompt, few-shot, and CoT-style prompting.
arXiv Detail & Related papers (2025-11-13T06:12:12Z)
EduDial: Constructing a Large-scale Multi-turn Teacher-Student Dialogue Corpus [59.693733170193944]
We present EduDial, a comprehensive multi-turn teacher-student dialogue dataset.<n>EduDial covers 345 core knowledge points and consists of 34,250 dialogue sessions generated through interactions between teacher and student agents.
arXiv Detail & Related papers (2025-10-14T18:18:43Z)
MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams [50.293164501645975]
Multimodal large language models (MLLMs) integrate language and visual cues for problem-solving.<n>Current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge.<n>We introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K-12 exams spanning six disciplines.
arXiv Detail & Related papers (2025-08-09T06:21:10Z)
ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios [23.549720214649476]
Large Language Models (LLMs) present transformative opportunities for education, generating numerous novel application scenarios.<n>Current benchmarks predominantly measure general intelligence rather than pedagogical capabilities.<n>We introduce ELMES, an open-source automated evaluation framework specifically designed for assessing LLMs in educational settings.
arXiv Detail & Related papers (2025-07-27T15:20:19Z)
Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models [13.790068801864855]
Edu-Values is the first Chinese education values evaluation benchmark that includes seven core values.<n>Edu-Values includes professional philosophy, teachers' professional ethics, education laws and regulations, cultural literacy, educational knowledge and skills, basic competencies and subject knowledge.
arXiv Detail & Related papers (2024-09-19T13:02:54Z)
LOVA3: Learning to Visual Question Answering, Asking and Assessment [61.51687164769517]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge.<n>Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills.<n>We introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment"
arXiv Detail & Related papers (2024-05-23T18:21:59Z)
FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models [64.11333762954283]
This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs. We present an extensive evaluation of 12 state-of-the-art LLMs using FoundaBench, employing both traditional assessment methods and our CircularEval protocol to mitigate potential biases in model responses. Our results highlight the superior performance of models pre-trained on Chinese corpora, and reveal a significant disparity between models' reasoning and memory recall capabilities.
arXiv Detail & Related papers (2024-04-29T01:49:07Z)
LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z)
KoLA: Carefully Benchmarking World Knowledge of Large Language Models [87.96683299084788]
We construct a Knowledge-oriented LLM Assessment benchmark (KoLA) We mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. We use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, to evaluate the capacity to handle unseen data and evolving knowledge.
arXiv Detail & Related papers (2023-06-15T17:20:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.