Related papers: Autograding Mathematical Induction Proofs with Natural Language Processing

Autograding Mathematical Induction Proofs with Natural Language Processing

URL: http://arxiv.org/abs/2406.10268v1
Date: Tue, 11 Jun 2024 15:30:26 GMT
Title: Autograding Mathematical Induction Proofs with Natural Language Processing
Authors: Chenyan Zhao, Mariana Silva, Seth Poulsen,
Abstract summary: We present a set of training methods and models capable of autograding freeform mathematical proofs. The models are trained using proof data collected from four different proof by induction problems. We recruit human graders to grade the same proofs as the training data, and find that the best grading model is also more accurate than most human graders.
Score: 0.12289361708127876
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In mathematical proof education, there remains a need for interventions that help students learn to write mathematical proofs. Research has shown that timely feedback can be very helpful to students learning new skills. While for many years natural language processing models have struggled to perform well on tasks related to mathematical texts, recent developments in natural language processing have created the opportunity to complete the task of giving students instant feedback on their mathematical proofs. In this paper, we present a set of training methods and models capable of autograding freeform mathematical proofs by leveraging existing large language models and other machine learning techniques. The models are trained using proof data collected from four different proof by induction problems. We use four different robust large language models to compare their performances, and all achieve satisfactory performances to various degrees. Additionally, we recruit human graders to grade the same proofs as the training data, and find that the best grading model is also more accurate than most human graders. With the development of these grading models, we create and deploy an autograder for proof by induction problems and perform a user study with students. Results from the study shows that students are able to make significant improvements to their proofs using the feedback from the autograder, but students still do not trust the AI autograders as much as they trust human graders. Future work can improve on the autograder feedback and figure out ways to help students trust AI autograders.

Related papers

Proof-RM: A Scalable and Generalizable Reward Model for Math Proof [67.53066972145183]
Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with *Verifiable Rewards* (RLVR)<n>Many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching.<n>To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required.
arXiv Detail & Related papers (2026-02-02T17:42:53Z)
Turning Language Model Training from Black Box into a Sandbox [2.8821062918162146]
Browser-based tool allows students to train a small transformer language model entirely on their own device.<n>In a CS1 course, 162 students completed pre- and post-test explanations of why language models sometimes produce incorrect or strange output.
arXiv Detail & Related papers (2026-01-29T12:30:55Z)
Learning to Make MISTAKEs: Modeling Incorrect Student Thinking And Key Errors [58.65143578052761]
This paper presents a new method, MISTAKE, that constructs high-quality synthetic examples of reasoning errors.<n>We evaluate MISTAKE on three educational tasks and find that it results in (1) higher accuracy when simulating incorrect student answers.
arXiv Detail & Related papers (2025-10-13T15:10:38Z)
MathEDU: Towards Adaptive Feedback for Student Mathematical Problem-Solving [3.2962799070467432]
This paper explores the capabilities of large language models (LLMs) to assess students' math problem-solving processes and provide adaptive feedback.<n>We evaluate the model's ability to support personalized learning in two scenarios: one where the model has access to students' prior answer histories, and another simulating a cold-start context.
arXiv Detail & Related papers (2025-05-23T15:59:39Z)
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora [84.03928547166873]
Children can acquire language from less than 100 million words of input. Large language models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data and still do not perform as well as humans on many evaluations. The BabyLM Challenge is a communal effort in which participants compete to optimize language model training on a fixed data budget.
arXiv Detail & Related papers (2025-04-10T23:22:43Z)
LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose Math-Minos, a natural language feedback-enhanced verifier. Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z)
Toward In-Context Teaching: Adapting Examples to Students' Misconceptions [54.82965010592045]
We introduce a suite of models and evaluation methods we call AdapT. AToM is a new probabilistic model for adaptive teaching that jointly infers students' past beliefs and optimize for the correctness of future beliefs. Our results highlight both the difficulty of the adaptive teaching task and the potential of learned adaptive models for solving it.
arXiv Detail & Related papers (2024-05-07T17:05:27Z)
Autonomous Data Selection with Language Models for Mathematical Texts [13.789739307267952]
We introduce a novel strategy that leverages base language models for autonomous data selection. Our approach utilizes meta-prompted language models as zero-shot verifiers to evaluate and select high-quality mathematical content autonomously. Our method showcases a 2 times increase in pretraining token efficiency compared to state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-12T13:09:21Z)
YODA: Teacher-Student Progressive Learning for Language Models [82.0172215948963]
This paper introduces YODA, a teacher-student progressive learning framework. It emulates the teacher-student education process to improve the efficacy of model fine-tuning. Experiments show that training LLaMA2 with data from YODA improves SFT with significant performance gain.
arXiv Detail & Related papers (2024-01-28T14:32:15Z)
Baldur: Whole-Proof Generation and Repair with Large Language Models [8.100054850290507]
We use large language models, trained on natural language text and code and fine-tuned on proofs, to generate whole proofs for theorems at once. We combine this proof generation model with a fine-tuned repair model to repair generated proofs, further increasing proving power. We evaluate our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs.
arXiv Detail & Related papers (2023-03-08T22:00:15Z)
Context Matters: A Strategy to Pre-train Language Model for Science Education [4.053049694533914]
BERT-based language models have shown significant superiority over traditional NLP models in various language-related tasks. The language used by students is different from the language in journals and Wikipedia, which are training sources of BERT. Our study confirms the effectiveness of continual pre-training on domain-specific data in the education domain.
arXiv Detail & Related papers (2023-01-27T23:50:16Z)
MOCHA: A Multi-Task Training Approach for Coherent Text Generation from Cognitive Perspective [22.69509556890676]
We propose a novel multi-task training strategy for coherent text generation grounded on the cognitive theory of writing. We extensively evaluate our model on three open-ended generation tasks including story generation, news article writing and argument generation.
arXiv Detail & Related papers (2022-10-26T11:55:41Z)
NaturalProver: Grounded Mathematical Proof Generation with Language Models [84.2064569475095]
Theorem proving in natural mathematical language plays a central role in mathematical advances and education. We develop NaturalProver, a language model that generates proofs by conditioning on background references. NaturalProver is capable of proving some theorems that require short (2-6 step) proofs, and providing next-step suggestions that are rated as correct and useful over 40% of the time.
arXiv Detail & Related papers (2022-05-25T17:01:18Z)
Towards Trustworthy AutoGrading of Short, Multi-lingual, Multi-type Answers [2.2000998828262652]
This study uses a large dataset consisting of about 10 million question-answer pairs from multiple languages. We show how to improve the accuracy of automatically graded answers, achieving accuracy equivalent to that of teaching assistants.
arXiv Detail & Related papers (2022-01-02T12:17:24Z)
Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression. Current methods assign a fixed weight to a teacher model in the whole distillation. Most of the existing methods allocate an equal weight to every teacher model. In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z)
Learning to Reweight with Deep Interactions [104.68509759134878]
We propose an improved data reweighting algorithm, in which the student model provides its internal states to the teacher model. Experiments on image classification with clean/noisy labels and neural machine translation empirically demonstrate that our algorithm makes significant improvement over previous methods.
arXiv Detail & Related papers (2020-07-09T09:06:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.