Learning to Love Edge Cases in Formative Math Assessment: Using the AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy
- URL: http://arxiv.org/abs/2409.17904v1
- Date: Thu, 26 Sep 2024 14:51:40 GMT
- Title: Learning to Love Edge Cases in Formative Math Assessment: Using the AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy
- Authors: Owen Henkel, Hannah Horne-Robinson, Maria Dyshel, Nabil Ch, Baptiste Moreau-Pernet, Ralph Abood,
- Abstract summary: This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori.
We conduct two experiments to evaluate the use of large language models (LLM) for grading challenging student answers.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori, a learning platform used by students in several African countries and conducts two experiments to evaluate the use of large language models (LLM) for grading particularly challenging student answers. The AMMORE dataset enables various potential analyses and provides an important resource for researching student math acquisition in understudied, real-world, educational contexts. In experiment 1 we use a variety of LLM-driven approaches, including zero-shot, few-shot, and chain-of-thought prompting, to grade the 1% of student answers that a rule-based classifier fails to grade accurately. We find that the best-performing approach -- chain-of-thought prompting -- accurately scored 92% of these edge cases, effectively boosting the overall accuracy of the grading from 98.7% to 99.9%. In experiment 2, we aim to better understand the consequential validity of the improved grading accuracy, by passing grades generated by the best-performing LLM-based approach to a Bayesian Knowledge Tracing (BKT) model, which estimated student mastery of specific lessons. We find that relatively modest improvements in model accuracy at the individual question level can lead to significant changes in the estimation of student mastery. Where the rules-based classifier currently used to grade student, answers misclassified the mastery status of 6.9% of students across their completed lessons, using the LLM chain-of-thought approach this misclassification rate was reduced to 2.6% of students. Taken together, these findings suggest that LLMs could be a valuable tool for grading open-response questions in K-12 mathematics education, potentially enabling encouraging wider adoption of open-ended questions in formative assessment.
Related papers
- Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - Investigating Automatic Scoring and Feedback using Large Language Models [46.1232919707345]
This paper explores the efficacy of PEFT-based quantized models, employing classification or regression head, to fine-tune language models for automatic grading and feedback generation.
The results show that prediction of grade scores via finetuned LLMs are highly accurate, achieving less than 3% error in grade percentage on average.
arXiv Detail & Related papers (2024-05-01T16:13:54Z) - BAL: Balancing Diversity and Novelty for Active Learning [53.289700543331925]
We introduce a novel framework, Balancing Active Learning (BAL), which constructs adaptive sub-pools to balance diverse and uncertain data.
Our approach outperforms all established active learning methods on widely recognized benchmarks by 1.20%.
arXiv Detail & Related papers (2023-12-26T08:14:46Z) - A Predictive Model using Machine Learning Algorithm in Identifying
Students Probability on Passing Semestral Course [0.0]
This study employs classification for data mining techniques, and decision tree for algorithm.
With the utilization of the newly discovered predictive model, the prediction of students probabilities to pass the current courses they take gives 0.7619 accuracy, 0.8333 precision, 0.8823 recall, and 0.8571 f1 score.
arXiv Detail & Related papers (2023-04-12T01:57:08Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - Exploring Bayesian Deep Learning for Urgent Instructor Intervention Need
in MOOC Forums [58.221459787471254]
Massive Open Online Courses (MOOCs) have become a popular choice for e-learning thanks to their great flexibility.
Due to large numbers of learners and their diverse backgrounds, it is taxing to offer real-time support.
With the large volume of posts and high workloads for MOOC instructors, it is unlikely that the instructors can identify all learners requiring intervention.
This paper explores for the first time Bayesian deep learning on learner-based text posts with two methods: Monte Carlo Dropout and Variational Inference.
arXiv Detail & Related papers (2021-04-26T15:12:13Z) - LANA: Towards Personalized Deep Knowledge Tracing Through
Distinguishable Interactive Sequences [21.67751919579854]
We propose Leveled Attentive KNowledge TrAcing (LANA) to predict students' responses to future questions.
It uses a novel student-related features extractor (SRFE) to distill students' unique inherent properties from their respective interactive sequences.
With pivot module reconstructed the decoder for individual students and leveled learning specialized encoders for groups, personalized DKT was achieved.
arXiv Detail & Related papers (2021-04-21T02:57:42Z) - Extracurricular Learning: Knowledge Transfer Beyond Empirical
Distribution [17.996541285382463]
We propose extracurricular learning to bridge the gap between a compressed student model and its teacher.
We conduct rigorous evaluations on regression and classification tasks and show that compared to the standard knowledge distillation, extracurricular learning reduces the gap by 46% to 68%.
This leads to major accuracy improvements compared to the empirical risk minimization-based training for various recent neural network architectures.
arXiv Detail & Related papers (2020-06-30T18:21:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.