Related papers: Benchmarking Large Language Models for Calculus Problem-Solving: A Comparative Analysis

Benchmarking Large Language Models for Calculus Problem-Solving: A Comparative Analysis

URL: http://arxiv.org/abs/2504.13187v1
Date: Mon, 31 Mar 2025 00:39:40 GMT
Title: Benchmarking Large Language Models for Calculus Problem-Solving: A Comparative Analysis
Authors: In Hak Moon,
Abstract summary: Five leading large language models (LLMs) were evaluated on their performance in solving calculus differentiation problems.<n>Chat GPT 4o achieved the highest success rate (94.71%), followed by Claude Pro (85.74%), Gemini Advanced (84.42%), Copilot Pro (76.30%), and Meta AI (56.75%)
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study presents a comprehensive evaluation of five leading large language models (LLMs) - Chat GPT 4o, Copilot Pro, Gemini Advanced, Claude Pro, and Meta AI - on their performance in solving calculus differentiation problems. The investigation assessed these models across 13 fundamental problem types, employing a systematic cross-evaluation framework where each model solved problems generated by all models. Results revealed significant performance disparities, with Chat GPT 4o achieving the highest success rate (94.71%), followed by Claude Pro (85.74%), Gemini Advanced (84.42%), Copilot Pro (76.30%), and Meta AI (56.75%). All models excelled at procedural differentiation tasks but showed varying limitations with conceptual understanding and algebraic manipulation. Notably, problems involving increasing/decreasing intervals and optimization word problems proved most challenging across all models. The cross-evaluation matrix revealed that Claude Pro generated the most difficult problems, suggesting distinct capabilities between problem generation and problem-solving. These findings have significant implications for educational applications, highlighting both the potential and limitations of LLMs as calculus learning tools. While they demonstrate impressive procedural capabilities, their conceptual understanding remains limited compared to human mathematical reasoning, emphasizing the continued importance of human instruction for developing deeper mathematical comprehension.

Related papers

Inverse Scaling in Test-Time Compute [51.16323216811257]
Extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance.<n>We identify five distinct failure modes when models reason for longer.<n>These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns.
arXiv Detail & Related papers (2025-07-19T00:06:13Z)
Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead [33.011660907969706]
Inference-time scaling can enhance the reasoning capabilities of large language models. We investigate the benefits and limitations of scaling methods across nine state-of-the-art models and eight challenging tasks.
arXiv Detail & Related papers (2025-03-31T23:40:28Z)
Performance Comparison of Large Language Models on Advanced Calculus Problems [0.0]
The study aims to evaluate models' accuracy, reliability, and problem-solving capabilities, including ChatGPT 4o, Gemini Advanced with 1.5 Pro, Copilot Pro, Claude 3.5 Sonnet, Meta AI, Mistral AI, and Perplexity.<n>The results highlight significant trends and patterns in the models' performance, revealing both their strengths and weaknesses.
arXiv Detail & Related papers (2025-03-05T23:26:12Z)
Large Language Models and Mathematical Reasoning Failures [1.6114012813668932]
This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems.<n>We rigorously analyze both final answers and solution steps to identify reasoning failures.<n>We find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic.
arXiv Detail & Related papers (2025-02-17T09:07:32Z)
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [90.07275414500154]
We observe significant performance drops on MATH-P-Hard across various models. We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills.
arXiv Detail & Related papers (2025-02-10T13:31:46Z)
ProcessBench: Identifying Process Errors in Mathematical Reasoning [62.80402845414901]
We introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning.<n>ProcessBench consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems.<n>We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models.
arXiv Detail & Related papers (2024-12-09T15:11:40Z)
LLM4DS: Evaluating Large Language Models for Data Science Code Generation [0.0]
This paper empirically assesses the performance of four leading AI assistants-Microsoft Copilot (GPT-4 Turbo), ChatGPT (o1-preview), Claude (3.5 Sonnet) and Perplexity Labs (Llama-3.1-70b-instruct) All models exceeded a 50% success rate, confirming their capability beyond random chance. ChatGPT demonstrated consistent performance across varying difficulty levels, while Claude's success rate fluctuated with task complexity.
arXiv Detail & Related papers (2024-11-16T18:43:26Z)
Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths [69.39559168050923]
We introduce Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths. Our approach encourages favorable branches at each reasoning step while penalizing unfavorable ones, enhancing the model's overall problem-solving performance. We focus on multi-step reasoning tasks, such as math word problems and science-based exam questions.
arXiv Detail & Related papers (2024-10-07T06:37:25Z)
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection. ErrorRadar evaluates two sub-tasks: error step identification and error categorization. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z)
MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification [41.53026834367054]
This paper introduces a novel benchmark, MM-MATH, for evaluating multimodal math reasoning. MM-MATH consists of 5,929 open-ended middle school math problems with visual contexts, with fine-grained classification across difficulty, grade level, and knowledge points. The best-performing model achieves only 31% accuracy on MM-MATH, compared to 82% for humans.
arXiv Detail & Related papers (2024-04-07T22:16:50Z)
Examining the Potential and Pitfalls of ChatGPT in Science and Engineering Problem-Solving [1.3628066756509705]
The study explores the capabilities of OpenAI's ChatGPT in solving different types of physics problems. ChatGPT could successfully solve 62.5% of the well-specified problems, but its accuracy drops to 8.3% for under-specified problems.
arXiv Detail & Related papers (2023-10-12T23:39:28Z)
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving [170.7899683843177]
ToRA is a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical problems. ToRA models significantly outperform open-source models on 10 mathematical reasoning datasets across all scales. ToRA-Code-34B is the first open-source model that achieves an accuracy exceeding 50% on MATH.
arXiv Detail & Related papers (2023-09-29T17:59:38Z)
Let's Verify Step by Step [73.58107073356732]
We show that process supervision significantly outperforms outcome supervision for training models to solve problems. Our model solves 78% of problems from a representative subset of the MATH test set. We also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
arXiv Detail & Related papers (2023-05-31T17:24:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.