Related papers: LLM-Driven Rubric-Based Assessment of Algebraic Competence in Multi-Stage Block Coding Tasks with Design and Field Evaluation

LLM-Driven Rubric-Based Assessment of Algebraic Competence in Multi-Stage Block Coding Tasks with Design and Field Evaluation

URL: http://arxiv.org/abs/2510.06253v1
Date: Sat, 04 Oct 2025 01:00:33 GMT
Title: LLM-Driven Rubric-Based Assessment of Algebraic Competence in Multi-Stage Block Coding Tasks with Design and Field Evaluation
Authors: Yong Oh Lee, Byeonghun Bang, Sejun Oh,
Abstract summary: This study proposes and evaluates a rubric-based assessment framework powered by a large language model (LLM)<n>The problem set, designed by mathematics education experts, aligns each problem segment with five predefined rubric dimensions.<n>The study integrated learner self-assessments and expert ratings to benchmark the system's outputs.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: As online education platforms continue to expand, there is a growing need for assessment methods that not only measure answer accuracy but also capture the depth of students' cognitive processes in alignment with curriculum objectives. This study proposes and evaluates a rubric-based assessment framework powered by a large language model (LLM) for measuring algebraic competence, real-world-context block coding tasks. The problem set, designed by mathematics education experts, aligns each problem segment with five predefined rubric dimensions, enabling the LLM to assess both correctness and quality of students' problem-solving processes. The system was implemented on an online platform that records all intermediate responses and employs the LLM for rubric-aligned achievement evaluation. To examine the practical effectiveness of the proposed framework, we conducted a field study involving 42 middle school students engaged in multi-stage quadratic equation tasks with block coding. The study integrated learner self-assessments and expert ratings to benchmark the system's outputs. The LLM-based rubric evaluation showed strong agreement with expert judgments and consistently produced rubric-aligned, process-oriented feedback. These results demonstrate both the validity and scalability of incorporating LLM-driven rubric assessment into online mathematics and STEM education platforms.

Related papers

AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field [12.465017512854475]
Large language models (LLMs) are seeing increasing adoption in the Architecture, Engineering, and Construction (AEC) field.<n>This paper establishes AECBench, a benchmark designed to quantify the strengths and limitations of current LLMs in the AEC domain.<n>The benchmark defines 23 representative tasks within a five-level cognition-oriented evaluation framework.
arXiv Detail & Related papers (2025-09-23T08:09:58Z)
ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios [23.549720214649476]
Large Language Models (LLMs) present transformative opportunities for education, generating numerous novel application scenarios.<n>Current benchmarks predominantly measure general intelligence rather than pedagogical capabilities.<n>We introduce ELMES, an open-source automated evaluation framework specifically designed for assessing LLMs in educational settings.
arXiv Detail & Related papers (2025-07-27T15:20:19Z)
OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics [101.78963920333342]
We introduce OpenUnlearning, a standardized framework for benchmarking large language models (LLMs) unlearning methods and metrics.<n>OpenUnlearning integrates 9 unlearning algorithms and 16 diverse evaluations across 3 leading benchmarks.<n>We also benchmark diverse unlearning methods and provide a comparative analysis against an extensive evaluation suite.
arXiv Detail & Related papers (2025-06-14T20:16:37Z)
Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning [19.4760649326684]
Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines.<n>With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings.<n>Existing benchmarks fall short in providing both a free-form output structure and a rigorous human expert validation process, limiting their effectiveness in evaluating real-world educational tasks.
arXiv Detail & Related papers (2025-05-16T11:01:01Z)
Evaluating Large Language Models for Real-World Engineering Tasks [75.97299249823972]
This paper introduces a curated database comprising over 100 questions derived from authentic, production-oriented engineering scenarios.<n>Using this dataset, we evaluate four state-of-the-art Large Language Models (LLMs)<n>Our results show that LLMs demonstrate strengths in basic temporal and structural reasoning but struggle significantly with abstract reasoning, formal modeling, and context-sensitive engineering logic.
arXiv Detail & Related papers (2025-05-12T14:05:23Z)
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.<n>In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.<n>This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z)
CIBench: Evaluating Your LLMs with a Code Interpreter Plugin [68.95137938214862]
We propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks. The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions. We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.
arXiv Detail & Related papers (2024-07-15T07:43:55Z)
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
Through the Lens of Core Competency: Survey on Evaluation of Large Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses. Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios. We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety. Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z)
A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry. This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.