Autograding Mathematical Induction Proofs with Natural Language Processing
- URL: http://arxiv.org/abs/2406.10268v1
- Date: Tue, 11 Jun 2024 15:30:26 GMT
- Title: Autograding Mathematical Induction Proofs with Natural Language Processing
- Authors: Chenyan Zhao, Mariana Silva, Seth Poulsen,
- Abstract summary: We present a set of training methods and models capable of autograding freeform mathematical proofs.
The models are trained using proof data collected from four different proof by induction problems.
We recruit human graders to grade the same proofs as the training data, and find that the best grading model is also more accurate than most human graders.
- Score: 0.12289361708127876
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In mathematical proof education, there remains a need for interventions that help students learn to write mathematical proofs. Research has shown that timely feedback can be very helpful to students learning new skills. While for many years natural language processing models have struggled to perform well on tasks related to mathematical texts, recent developments in natural language processing have created the opportunity to complete the task of giving students instant feedback on their mathematical proofs. In this paper, we present a set of training methods and models capable of autograding freeform mathematical proofs by leveraging existing large language models and other machine learning techniques. The models are trained using proof data collected from four different proof by induction problems. We use four different robust large language models to compare their performances, and all achieve satisfactory performances to various degrees. Additionally, we recruit human graders to grade the same proofs as the training data, and find that the best grading model is also more accurate than most human graders. With the development of these grading models, we create and deploy an autograder for proof by induction problems and perform a user study with students. Results from the study shows that students are able to make significant improvements to their proofs using the feedback from the autograder, but students still do not trust the AI autograders as much as they trust human graders. Future work can improve on the autograder feedback and figure out ways to help students trust AI autograders.
Related papers
- LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose Math-Minos, a natural language feedback-enhanced verifier.
Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z) - Toward In-Context Teaching: Adapting Examples to Students' Misconceptions [54.82965010592045]
We introduce a suite of models and evaluation methods we call AdapT.
AToM is a new probabilistic model for adaptive teaching that jointly infers students' past beliefs and optimize for the correctness of future beliefs.
Our results highlight both the difficulty of the adaptive teaching task and the potential of learned adaptive models for solving it.
arXiv Detail & Related papers (2024-05-07T17:05:27Z) - Autonomous Data Selection with Language Models for Mathematical Texts [13.789739307267952]
We introduce a novel strategy that leverages base language models for autonomous data selection.
Our approach utilizes meta-prompted language models as zero-shot verifiers to evaluate and select high-quality mathematical content autonomously.
Our method showcases a 2 times increase in pretraining token efficiency compared to state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-12T13:09:21Z) - YODA: Teacher-Student Progressive Learning for Language Models [82.0172215948963]
This paper introduces YODA, a teacher-student progressive learning framework.
It emulates the teacher-student education process to improve the efficacy of model fine-tuning.
Experiments show that training LLaMA2 with data from YODA improves SFT with significant performance gain.
arXiv Detail & Related papers (2024-01-28T14:32:15Z) - Baldur: Whole-Proof Generation and Repair with Large Language Models [8.100054850290507]
We use large language models, trained on natural language text and code and fine-tuned on proofs, to generate whole proofs for theorems at once.
We combine this proof generation model with a fine-tuned repair model to repair generated proofs, further increasing proving power.
We evaluate our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs.
arXiv Detail & Related papers (2023-03-08T22:00:15Z) - MOCHA: A Multi-Task Training Approach for Coherent Text Generation from
Cognitive Perspective [22.69509556890676]
We propose a novel multi-task training strategy for coherent text generation grounded on the cognitive theory of writing.
We extensively evaluate our model on three open-ended generation tasks including story generation, news article writing and argument generation.
arXiv Detail & Related papers (2022-10-26T11:55:41Z) - NaturalProver: Grounded Mathematical Proof Generation with Language
Models [84.2064569475095]
Theorem proving in natural mathematical language plays a central role in mathematical advances and education.
We develop NaturalProver, a language model that generates proofs by conditioning on background references.
NaturalProver is capable of proving some theorems that require short (2-6 step) proofs, and providing next-step suggestions that are rated as correct and useful over 40% of the time.
arXiv Detail & Related papers (2022-05-25T17:01:18Z) - Towards Trustworthy AutoGrading of Short, Multi-lingual, Multi-type
Answers [2.2000998828262652]
This study uses a large dataset consisting of about 10 million question-answer pairs from multiple languages.
We show how to improve the accuracy of automatically graded answers, achieving accuracy equivalent to that of teaching assistants.
arXiv Detail & Related papers (2022-01-02T12:17:24Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - Learning to Reweight with Deep Interactions [104.68509759134878]
We propose an improved data reweighting algorithm, in which the student model provides its internal states to the teacher model.
Experiments on image classification with clean/noisy labels and neural machine translation empirically demonstrate that our algorithm makes significant improvement over previous methods.
arXiv Detail & Related papers (2020-07-09T09:06:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.