Related papers: Beyond Final Answers: Evaluating Large Language Models for Math Tutoring

Beyond Final Answers: Evaluating Large Language Models for Math Tutoring

URL: http://arxiv.org/abs/2503.16460v1
Date: Sun, 23 Feb 2025 15:43:45 GMT
Title: Beyond Final Answers: Evaluating Large Language Models for Math Tutoring
Authors: Adit Gupta, Jennifer Reddig, Tommaso Calo, Daniel Weitekamp, Christopher J. MacLellan,
Abstract summary: We present two approaches to evaluate the correctness and quality of Large Language Models (LLMs) in math tutoring contexts.<n>The first approach uses an intelligent tutoring system for college algebra as a testbed to assess LLM problem-solving capabilities.<n>The second approach evaluates LLM as tutors rather than problem solvers.
Score: 0.24197860834245388
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Researchers have made notable progress in applying Large Language Models (LLMs) to solve math problems, as demonstrated through efforts like GSM8k, ProofNet, AlphaGeometry, and MathOdyssey. This progress has sparked interest in their potential use for tutoring students in mathematics. However, the reliability of LLMs in tutoring contexts -- where correctness and instructional quality are crucial -- remains underexplored. Moreover, LLM problem-solving capabilities may not necessarily translate into effective tutoring support for students. In this work, we present two novel approaches to evaluate the correctness and quality of LLMs in math tutoring contexts. The first approach uses an intelligent tutoring system for college algebra as a testbed to assess LLM problem-solving capabilities. We generate benchmark problems using the tutor, prompt a diverse set of LLMs to solve them, and compare the solutions to those generated by the tutor. The second approach evaluates LLM as tutors rather than problem solvers. We employ human evaluators, who act as students seeking tutoring support from each LLM. We then assess the quality and correctness of the support provided by the LLMs via a qualitative coding process. We applied these methods to evaluate several ChatGPT models, including 3.5 Turbo, 4, 4o, o1-mini, and o1-preview. Our findings show that when used as problem solvers, LLMs generate correct final answers for 85.5% of the college algebra problems tested. When employed interactively as tutors, 90% of LLM dialogues show high-quality instructional support; however, many contain errors -- only 56.6% are entirely correct. We conclude that, despite their potential, LLMs are not yet suitable as intelligent tutors for math without human oversight or additional mechanisms to ensure correctness and quality.

Related papers

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs [2.2330469342127577]
We introduce U-MATH, a novel benchmark of 1,100 unpublished open-ended university-level problems sourced from teaching materials.<n>It is balanced across six core subjects, with 20% of multimodal problems.<n>Given the open-ended nature of U-MATH problems, we employ an LLM to judge the correctness of generated solutions.<n>Our findings reveal that LLMs achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45% on visual problems.
arXiv Detail & Related papers (2024-12-04T10:44:50Z)
Embracing AI in Education: Understanding the Surge in Large Language Model Use by Secondary Students [53.20318273452059]
Large language models (LLMs) like OpenAI's ChatGPT have opened up new avenues in education.<n>Despite school restrictions, our survey of over 300 middle and high school students revealed that a remarkable 70% of students have utilized LLMs.<n>We propose a few ideas to address such issues, including subject-specific models, personalized learning, and AI classrooms.
arXiv Detail & Related papers (2024-11-27T19:19:34Z)
Towards the Pedagogical Steering of Large Language Models for Tutoring: A Case Study with Modeling Productive Failure [36.83786872708736]
One-to-one tutoring is one of the most efficient methods of teaching. We develop StratL, an algorithm to optimize LLM prompts and steer it to follow a predefined multi-turn tutoring plan represented as a transition graph. As a case study, we create a prototype tutor for high school math following Productive Failure (PF), an advanced and effective learning design.
arXiv Detail & Related papers (2024-10-03T16:15:41Z)
Not All LLM Reasoners Are Created Equal [58.236453890457476]
We study the depth of grade-school math problem-solving capabilities of LLMs. We evaluate their performance on pairs of existing math word problems together.
arXiv Detail & Related papers (2024-10-02T17:01:10Z)
Exploring Knowledge Tracing in Tutor-Student Dialogues using LLMs [49.18567856499736]
We investigate whether large language models (LLMs) can be supportive of open-ended dialogue tutoring.<n>We apply a range of knowledge tracing (KT) methods on the resulting labeled data to track student knowledge levels over an entire dialogue.<n>We conduct experiments on two tutoring dialogue datasets, and show that a novel yet simple LLM-based method, LLMKT, significantly outperforms existing KT methods in predicting student response correctness in dialogues.
arXiv Detail & Related papers (2024-09-24T22:31:39Z)
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? [99.0305256706604]
We introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning.
arXiv Detail & Related papers (2024-03-21T17:59:50Z)
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)
Adversarial Math Word Problem Generation [6.92510069380188]
We propose a new paradigm for ensuring fair evaluation of large language models (LLMs) We generate adversarial examples which preserve the structure and difficulty of the original questions aimed for assessment, but are unsolvable by LLMs. We conduct experiments on various open- and closed-source LLMs, quantitatively and qualitatively demonstrating that our method significantly degrades their math problem-solving ability.
arXiv Detail & Related papers (2024-02-27T22:07:52Z)
Novice Learner and Expert Tutor: Evaluating Math Reasoning Abilities of Large Language Models with Misconceptions [28.759189115877028]
We propose novel evaluations for mathematical reasoning capabilities of Large Language Models (LLMs) based on mathematical misconceptions. Our primary approach is to simulate LLMs as a novice learner and an expert tutor, aiming to identify the incorrect answer to math question resulted from a specific misconception.
arXiv Detail & Related papers (2023-10-03T21:19:50Z)
Evaluating Language Models for Mathematics through Interactions [116.67206980096513]
We introduce CheckMate, a prototype platform for humans to interact with and evaluate large language models (LLMs) We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics. We derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness.
arXiv Detail & Related papers (2023-06-02T17:12:25Z)
MathPrompter: Mathematical Reasoning using Large Language Models [7.953723258038284]
Large Language Models (LLMs) have limited performance when solving arithmetic reasoning tasks. MathPrompter uses the Zero-shot chain-of-thought prompting technique to generate multiple Algebraic expressions or Python functions to solve the same math problem in different ways.
arXiv Detail & Related papers (2023-03-04T04:43:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.