Chat-Based Support Alone May Not Be Enough: Comparing Conversational and Embedded LLM Feedback for Mathematical Proof Learning
- URL: http://arxiv.org/abs/2602.18807v1
- Date: Sat, 21 Feb 2026 11:52:25 GMT
- Title: Chat-Based Support Alone May Not Be Enough: Comparing Conversational and Embedded LLM Feedback for Mathematical Proof Learning
- Authors: Eason Chen, Sophia Judicke, Kayla Beigh, Xinyi Tang, Isabel Wang, Nina Yuan, Zimo Xiao, Chuangji Li, Shizhuo Li, Reed Luttmer, Shreya Singh, Maria Yampolsky, Naman Parikh, Yvonne Zhao, Meiyi Chen, Scarlett Huang, Anishka Mohanty, Gregory Johnson, John Mackey, Jionghao Lin, Ken Koedinger,
- Abstract summary: GPTutor is an LLM-powered tutoring system for an undergraduate discrete mathematics course.<n>It integrates two tools: a structured proof-review tool that provides embedded feedback on students' written proof attempts, and a chatbots for math questions.<n>In a staggered-access study with 148 students, earlier access was associated with higher homework performance during the interval when only the experimental group could use the system.<n>Usage logs show that students with lower self-efficacy and prior exam performance used both components more frequently.
- Score: 4.7092577379077
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We evaluate GPTutor, an LLM-powered tutoring system for an undergraduate discrete mathematics course. It integrates two LLM-supported tools: a structured proof-review tool that provides embedded feedback on students' written proof attempts, and a chatbot for math questions. In a staggered-access study with 148 students, earlier access was associated with higher homework performance during the interval when only the experimental group could use the system, while we did not observe this performance increase transfer to exam scores. Usage logs show that students with lower self-efficacy and prior exam performance used both components more frequently. Session-level behavioral labels, produced by human coding and scaled using an automated classifier, characterize how students engaged with the chatbot (e.g., answer-seeking or help-seeking). In models controlling for prior performance and self-efficacy, higher chatbot usage and answer-seeking behavior were negatively associated with subsequent midterm performance, whereas proof-review usage showed no detectable independent association. Together, the findings suggest that chatbot-based support alone may not reliably support transfer to independent assessment of math proof-learning outcomes, whereas work-anchored, structured feedback appears less associated with reduced learning.
Related papers
- Automated Feedback Generation for Undergraduate Mathematics: Development and Evaluation of an AI Teaching Assistant [0.0]
We present a system that processes free-form natural language input, handles a wide range of edge cases, and comments on the technical correctness of submitted proofs.<n>We show that by the metrics we evaluate, the quality of the feedback generated is comparable to that produced by human experts.<n>A version of our tool is deployed on the Imperial mathematics homework platform Lambda.
arXiv Detail & Related papers (2026-01-06T23:02:22Z) - Evaluating the Effectiveness of Large Language Models in Solving Simple Programming Tasks: A User-Centered Study [1.0467092641687232]
This study investigates how different interaction styles with ChatGPT-4o affect user performance on simple programming tasks.<n>I conducted a within-subjects experiment where fifteen high school students completed three problems under three distinct versions of the model.
arXiv Detail & Related papers (2025-07-05T13:52:31Z) - Can Large Language Models Help Students Prove Software Correctness? An Experimental Study with Dafny [75.55915044740566]
Students in computing education increasingly use large language models (LLMs) such as ChatGPT.<n>This paper investigates how students interact with an LLM when solving formal verification exercises in Dafny.
arXiv Detail & Related papers (2025-06-27T16:34:13Z) - PyEvalAI: AI-assisted evaluation of Jupyter Notebooks for immediate personalized feedback [43.56788158589046]
PyEvalAI scores Jupyter notebooks using a combination of unit tests and a locally hosted language model to preserve privacy.<n>A case study demonstrates its effectiveness in improving feedback speed and grading efficiency for exercises in a university-level course on numerics.
arXiv Detail & Related papers (2025-02-25T18:20:20Z) - "My Grade is Wrong!": A Contestable AI Framework for Interactive Feedback in Evaluating Student Essays [6.810086342993699]
This paper introduces CAELF, a Contestable AI Empowered LLM Framework for automating interactive feedback.
CAELF allows students to query, challenge, and clarify their feedback by integrating a multi-agent system with computational argumentation.
A case study on 500 critical thinking essays with user studies demonstrates that CAELF significantly improves interactive feedback.
arXiv Detail & Related papers (2024-09-11T17:59:01Z) - Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals [91.59906995214209]
We propose a new evaluation method, Counterfactual Attentiveness Test (CAT)
CAT uses counterfactuals by replacing part of the input with its counterpart from a different example, expecting an attentive model to change its prediction.
We show that GPT3 becomes less attentive with an increased number of demonstrations, while its accuracy on the test data improves.
arXiv Detail & Related papers (2023-11-16T06:27:35Z) - Empowering Private Tutoring by Chaining Large Language Models [87.76985829144834]
This work explores the development of a full-fledged intelligent tutoring system powered by state-of-the-art large language models (LLMs)
The system is into three inter-connected core processes-interaction, reflection, and reaction.
Each process is implemented by chaining LLM-powered tools along with dynamically updated memory modules.
arXiv Detail & Related papers (2023-09-15T02:42:03Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - Evaluating Language Models for Mathematics through Interactions [116.67206980096513]
We introduce CheckMate, a prototype platform for humans to interact with and evaluate large language models (LLMs)
We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics.
We derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness.
arXiv Detail & Related papers (2023-06-02T17:12:25Z) - Distilling ChatGPT for Explainable Automated Student Answer Assessment [19.604476650824516]
We introduce a novel framework that explores using ChatGPT, a cutting-edge large language model, for the concurrent tasks of student answer scoring and rationale generation.
Our experiments show that the proposed method improves the overall QWK score by 11% compared to ChatGPT.
arXiv Detail & Related papers (2023-05-22T12:11:39Z) - Active Teacher for Semi-Supervised Object Detection [80.10937030195228]
We propose a novel algorithm called Active Teacher for semi-supervised object detection (SSOD)
Active Teacher extends the teacher-student framework to an iterative version, where the label set is partially and gradually augmented by evaluating three key factors of unlabeled examples.
With this design, Active Teacher can maximize the effect of limited label information while improving the quality of pseudo-labels.
arXiv Detail & Related papers (2023-03-15T03:59:27Z) - Plagiarism deterrence for introductory programming [11.612194979331179]
A class-wide statistical characterization can be clearly shared with students via an intuitive new p-value.
A pairwise, compression-based similarity detection algorithm captures relationships between assignments more accurately.
An unbiased scoring system aids students and the instructor in understanding true independence of effort.
arXiv Detail & Related papers (2022-06-06T18:47:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.