ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios
- URL: http://arxiv.org/abs/2507.22947v1
- Date: Sun, 27 Jul 2025 15:20:19 GMT
- Title: ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios
- Authors: Shou'ang Wei, Xinyun Wang, Shuzhen Bi, Jian Chen, Ruijia Li, Bo Jiang, Xin Lin, Min Zhang, Yu Song, BingDong Li, Aimin Zhou, Hao Hao,
- Abstract summary: Large Language Models (LLMs) present transformative opportunities for education, generating numerous novel application scenarios.<n>Current benchmarks predominantly measure general intelligence rather than pedagogical capabilities.<n>We introduce ELMES, an open-source automated evaluation framework specifically designed for assessing LLMs in educational settings.
- Score: 23.549720214649476
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The emergence of Large Language Models (LLMs) presents transformative opportunities for education, generating numerous novel application scenarios. However, significant challenges remain: evaluation metrics vary substantially across different educational scenarios, while many emerging scenarios lack appropriate assessment metrics. Current benchmarks predominantly measure general intelligence rather than pedagogical capabilities. To address this gap, we introduce ELMES, an open-source automated evaluation framework specifically designed for assessing LLMs in educational settings. ELMES features a modular architecture that enables researchers to create dynamic, multi-agent dialogues through simple configuration files, facilitating flexible scenario design without requiring extensive programming expertise. The framework incorporates a hybrid evaluation engine that objectively quantifies traditionally subjective pedagogical metrics using an LLM-as-a-Judge methodology. We conduct systematic benchmarking of state-of-the-art LLMs across four critical educational scenarios: Knowledge Point Explanation, Guided Problem-Solving Teaching, Interdisciplinary Lesson Plan Generation, and Contextualized Question Generation, employing fine-grained metrics developed in collaboration with education specialists. Our results demonstrate distinct capability distributions among models, revealing context-specific strengths and limitations. ELMES provides educators and researchers with an accessible evaluation framework that significantly reduces adaptation barriers for diverse educational applications while advancing the practical implementation of LLMs in pedagogy. The framework is publicly available at \emph{https://github.com/sii-research/elmes.git}.
Related papers
- OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics [101.78963920333342]
We introduce OpenUnlearning, a standardized framework for benchmarking large language models (LLMs) unlearning methods and metrics.<n>OpenUnlearning integrates 9 unlearning algorithms and 16 diverse evaluations across 3 leading benchmarks.<n>We also benchmark diverse unlearning methods and provide a comparative analysis against an extensive evaluation suite.
arXiv Detail & Related papers (2025-06-14T20:16:37Z) - MLLM-CL: Continual Learning for Multimodal Large Language Models [62.90736445575181]
We introduce MLLM-CL, a novel benchmark encompassing domain and ability continual learning.<n>Our approach can integrate domain-specific knowledge and functional abilities with minimal forgetting, significantly outperforming existing methods.
arXiv Detail & Related papers (2025-06-05T17:58:13Z) - Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning [19.4760649326684]
Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines.<n>With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings.<n>Existing benchmarks fall short in providing both a free-form output structure and a rigorous human expert validation process, limiting their effectiveness in evaluating real-world educational tasks.
arXiv Detail & Related papers (2025-05-16T11:01:01Z) - Enhanced Bloom's Educational Taxonomy for Fostering Information Literacy in the Era of Large Language Models [16.31527042425208]
This paper proposes an LLM-driven Bloom's Educational Taxonomy that aims to recognize and evaluate students' information literacy (IL) with Large Language Models (LLMs)<n>The framework delineates the IL corresponding to the cognitive abilities required to use LLM into two distinct stages: Exploration & Action and Creation & Metacognition.
arXiv Detail & Related papers (2025-03-25T08:23:49Z) - A Novel Psychometrics-Based Approach to Developing Professional Competency Benchmark for Large Language Models [0.0]
We propose a comprehensive approach to benchmark development based on rigorous psychometric principles.
We make the first attempt to illustrate this approach by creating a new benchmark in the field of pedagogy and education.
We construct a novel benchmark guided by the Bloom's taxonomy and rigorously designed by a consortium of education experts trained in test development.
arXiv Detail & Related papers (2024-10-29T19:32:43Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - Scalable Language Model with Generalized Continual Learning [58.700439919096155]
The Joint Adaptive Re-ization (JARe) is integrated with Dynamic Task-related Knowledge Retrieval (DTKR) to enable adaptive adjustment of language models based on specific downstream tasks.
Our method demonstrates state-of-the-art performance on diverse backbones and benchmarks, achieving effective continual learning in both full-set and few-shot scenarios with minimal forgetting.
arXiv Detail & Related papers (2024-04-11T04:22:15Z) - AutoTutor meets Large Language Models: A Language Model Tutor with Rich Pedagogy and Guardrails [43.19453208130667]
Large Language Models (LLMs) have found several use cases in education, ranging from automatic question generation to essay evaluation.
In this paper, we explore the potential of using Large Language Models (LLMs) to author Intelligent Tutoring Systems.
We create a sample end-to-end tutoring system named MWPTutor, which uses LLMs to fill in the state space of a pre-defined finite state transducer.
arXiv Detail & Related papers (2024-02-14T14:53:56Z) - Solution-oriented Agent-based Models Generation with Verifier-assisted
Iterative In-context Learning [10.67134969207797]
Agent-based models (ABMs) stand as an essential paradigm for proposing and validating hypothetical solutions or policies.
Large language models (LLMs) encapsulating cross-domain knowledge and programming proficiency could potentially alleviate the difficulty of this process.
We present SAGE, a general solution-oriented ABM generation framework designed for automatic modeling and generating solutions for targeted problems.
arXiv Detail & Related papers (2024-02-04T07:59:06Z) - Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.