Related papers: MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring

MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring

URL: http://arxiv.org/abs/2510.23477v1
Date: Mon, 27 Oct 2025 16:11:49 GMT
Title: MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring
Authors: Tengchao Yang, Sichen Guo, Mengzhao Jia, Jiaming Su, Yuanyang Liu, Zhihan Zhang, Meng Jiang,
Abstract summary: MMTutorBench is the first benchmark for AI math tutoring.<n>It consists of 685 problems built around pedagogically significant key-steps.<n>Each problem is paired with problem-specific rubrics that enable fine-grained evaluation across six dimensions.
Score: 20.95651273361851
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Effective math tutoring requires not only solving problems but also diagnosing students' difficulties and guiding them step by step. While multimodal large language models (MLLMs) show promise, existing benchmarks largely overlook these tutoring skills. We introduce MMTutorBench, the first benchmark for AI math tutoring, consisting of 685 problems built around pedagogically significant key-steps. Each problem is paired with problem-specific rubrics that enable fine-grained evaluation across six dimensions, and structured into three tasks-Insight Discovery, Operation Formulation, and Operation Execution. We evaluate 12 leading MLLMs and find clear performance gaps between proprietary and open-source systems, substantial room compared to human tutors, and consistent trends across input variants: OCR pipelines degrade tutoring quality, few-shot prompting yields limited gains, and our rubric-based LLM-as-a-Judge proves highly reliable. These results highlight both the difficulty and diagnostic value of MMTutorBench for advancing AI tutoring.

Related papers

TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models [10.963195858672627]
TutorBench is a dataset and evaluation benchmark designed to rigorously evaluate the core tutoring skills of large language models (LLMs)<n>Samples are drawn from three common tutoring tasks: (i) generating adaptive explanations tailored to a student's confusion, (ii) providing actionable feedback on a student's work, and (iii) promoting active learning through effective hint generation.<n>We evaluate 16 frontier LLMs on TutorBench and present a detailed analysis of their performance and behavior.
arXiv Detail & Related papers (2025-10-03T01:41:09Z)
Mathematical Computation and Reasoning Errors by Large Language Models [3.0309252269809264]
Large Language Models (LLMs) are increasingly utilized in AI-driven educational instruction and assessment.<n>This study focuses on evaluating the accuracy of four LLMs solving three categories of math tasks, including arithmetic, algebra, and number theory.<n>It is observed that the reasoning-enhanced OpenAI o1 model consistently achieved higher or nearly perfect accuracy across all three math task categories.
arXiv Detail & Related papers (2025-08-13T16:33:02Z)
From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning [82.50157695987558]
Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy.<n>We propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors.
arXiv Detail & Related papers (2025-05-21T15:00:07Z)
Is your multimodal large language model a good science tutor? [14.505855717011725]
Multimodal large language models (MLLMs) demonstrate impressive performance on scientific reasoning tasks.<n>We propose a framework that evaluates MLLMs as science tutors using a comprehensive educational rubric and a simulated student model.
arXiv Detail & Related papers (2025-05-09T20:38:23Z)
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs.<n>OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z)
MathAgent: Leveraging a Mixture-of-Math-Agent Framework for Real-World Multimodal Mathematical Error Detection [53.325457460187046]
We introduce MathAgent, a novel Mixture-of-Math-Agent framework designed specifically to address these challenges.<n>MathAgent decomposes error detection into three phases, each handled by a specialized agent.<n>We evaluate MathAgent on real-world educational data, demonstrating approximately 5% higher accuracy in error step identification.
arXiv Detail & Related papers (2025-03-23T16:25:08Z)
Why Do Multi-Agent LLM Systems Fail? [91.39266556855513]
We present MAST (Multi-Agent System Failure taxonomy), the first empirically grounded taxonomy designed to understand MAS failures.<n>We analyze seven popular MAS frameworks across over 200 tasks, involving six expert human annotators.<n>We identify 14 unique failure modes, organized into 3 overarching categories, (i) specification issues, (ii) inter-agent misalignment, and (iii) task verification.
arXiv Detail & Related papers (2025-03-17T19:04:38Z)
Beyond Final Answers: Evaluating Large Language Models for Math Tutoring [0.24197860834245388]
We present two approaches to evaluate the correctness and quality of Large Language Models (LLMs) in math tutoring contexts.<n>The first approach uses an intelligent tutoring system for college algebra as a testbed to assess LLM problem-solving capabilities.<n>The second approach evaluates LLM as tutors rather than problem solvers.
arXiv Detail & Related papers (2025-02-23T15:43:45Z)
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection. ErrorRadar evaluates two sub-tasks: error step identification and error categorization. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z)
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models. It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z)
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems [50.76385564061713]
Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks.<n>CoT usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors.<n>We propose Deeply Understanding the Problems (DUP) to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors.
arXiv Detail & Related papers (2024-04-23T12:16:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.