Related papers: Performance Comparison of Large Language Models on Advanced Calculus Problems

Performance Comparison of Large Language Models on Advanced Calculus Problems

URL: http://arxiv.org/abs/2503.03960v1
Date: Wed, 05 Mar 2025 23:26:12 GMT
Title: Performance Comparison of Large Language Models on Advanced Calculus Problems
Authors: In Hak Moon,
Abstract summary: The study aims to evaluate models' accuracy, reliability, and problem-solving capabilities, including ChatGPT 4o, Gemini Advanced with 1.5 Pro, Copilot Pro, Claude 3.5 Sonnet, Meta AI, Mistral AI, and Perplexity.<n>The results highlight significant trends and patterns in the models' performance, revealing both their strengths and weaknesses.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents an in-depth analysis of the performance of seven different Large Language Models (LLMs) in solving a diverse set of math advanced calculus problems. The study aims to evaluate these models' accuracy, reliability, and problem-solving capabilities, including ChatGPT 4o, Gemini Advanced with 1.5 Pro, Copilot Pro, Claude 3.5 Sonnet, Meta AI, Mistral AI, and Perplexity. The assessment was conducted through a series of thirty-two test problems, encompassing a total of 320 points. The problems covered various topics, from vector calculations and geometric interpretations to integral evaluations and optimization tasks. The results highlight significant trends and patterns in the models' performance, revealing both their strengths and weaknesses - for instance, models like ChatGPT 4o and Mistral AI demonstrated consistent accuracy across various problem types, indicating their robustness and reliability in mathematical problem-solving, while models such as Gemini Advanced with 1.5 Pro and Meta AI exhibited specific weaknesses, particularly in complex problems involving integrals and optimization, suggesting areas for targeted improvements. The study also underscores the importance of re-prompting in achieving accurate solutions, as seen in several instances where models initially provided incorrect answers but corrected them upon re-prompting. Overall, this research provides valuable insights into the current capabilities and limitations of LLMs in the domain of math calculus, with the detailed analysis of each model's performance on specific problems offering a comprehensive understanding of their strengths and areas for improvement, contributing to the ongoing development and refinement of LLM technology. The findings are particularly relevant for educators, researchers, and developers seeking to leverage LLMs for educational and practical applications in mathematics.

Related papers

Evaluation of LLMs for mathematical problem solving [1.6811789875704863]
Large Language Models (LLMs) have shown impressive performance on a range of educational tasks, but are still understudied for their potential to solve mathematical problems.<n>We compare three prominent LLMs, including GPT-4o, DeepSeek-V3, and Gemini-2.0, on three mathematics datasets of varying complexities.<n>GPT-4o is the most stable and consistent in performance across all the datasets, but particularly it performs outstandingly in high-level questions of the MIT Open Courseware dataset.
arXiv Detail & Related papers (2025-05-30T23:37:37Z)
Benchmarking Large Language Models for Calculus Problem-Solving: A Comparative Analysis [0.0]
Five leading large language models (LLMs) were evaluated on their performance in solving calculus differentiation problems. Chat GPT 4o achieved the highest success rate (94.71%), followed by Claude Pro (85.74%), Gemini Advanced (84.42%), Copilot Pro (76.30%), and Meta AI (56.75%)
arXiv Detail & Related papers (2025-03-31T00:39:40Z)
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z)
MathAgent: Leveraging a Mixture-of-Math-Agent Framework for Real-World Multimodal Mathematical Error Detection [53.325457460187046]
We introduce MathAgent, a novel Mixture-of-Math-Agent framework designed specifically to address these challenges. MathAgent decomposes error detection into three phases, each handled by a specialized agent. We evaluate MathAgent on real-world educational data, demonstrating approximately 5% higher accuracy in error step identification.
arXiv Detail & Related papers (2025-03-23T16:25:08Z)
Large Language Models and Mathematical Reasoning Failures [1.6114012813668932]
This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems.<n>We rigorously analyze both final answers and solution steps to identify reasoning failures.<n>We find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic.
arXiv Detail & Related papers (2025-02-17T09:07:32Z)
Improving Multimodal LLMs Ability In Geometry Problem Solving, Reasoning, And Multistep Scoring [34.37450586634531]
This paper presents GPSM4K, a comprehensive geometry multimodal dataset tailored to augment the problem-solving capabilities of Large Vision Language Models (LVLMs)<n>GPSM4K encompasses 2157 multimodal question-answer pairs manually extracted from mathematics textbooks spanning grades 7-12.<n>This dataset serves as an excellent benchmark for assessing the geometric reasoning capabilities of LVLMs.
arXiv Detail & Related papers (2024-12-01T15:19:23Z)
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark [53.61633384281524]
PolyMATH is a benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs. The best scores achieved on PolyMATH are 41%, 36%, and 27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively. A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning.
arXiv Detail & Related papers (2024-10-06T20:35:41Z)
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection. ErrorRadar evaluates two sub-tasks: error step identification and error categorization. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z)
Evaluation of OpenAI o1: Opportunities and Challenges of AGI [112.0812059747033]
o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance. The model excelled in tasks requiring intricate reasoning and knowledge integration across various fields. Overall results indicate significant progress towards artificial general intelligence.
arXiv Detail & Related papers (2024-09-27T06:57:00Z)
Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning [11.63133816413199]
Large Language Models (LLMs) have been applied to Math Word Problems (MWPs) We introduce a novel dataset MWP-MISTAKE, incorporating MWPs with both correct and incorrect reasoning steps generated through rule-based methods and smaller language models. We highlight GPT-$o's superior performance in mistake detection and rectification and the persistent challenges faced by smaller models.
arXiv Detail & Related papers (2024-06-16T08:06:05Z)
MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification [41.53026834367054]
This paper introduces a novel benchmark, MM-MATH, for evaluating multimodal math reasoning. MM-MATH consists of 5,929 open-ended middle school math problems with visual contexts, with fine-grained classification across difficulty, grade level, and knowledge points. The best-performing model achieves only 31% accuracy on MM-MATH, compared to 82% for humans.
arXiv Detail & Related papers (2024-04-07T22:16:50Z)
GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving [40.46491587796371]
We introduce the GeoEval benchmark, a comprehensive collection that includes a main subset of 2,000 problems, a 750 problems subset focusing on backward reasoning, an augmented subset of 2,000 problems, and a hard subset of 300 problems. Our evaluation of ten LLMs and MMs across these varied subsets reveals that the WizardMath model excels, achieving a 55.67% accuracy rate on the main subset but only a 6.00% accuracy on the hard subset.
arXiv Detail & Related papers (2024-02-15T16:59:41Z)
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving [170.7899683843177]
ToRA is a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical problems. ToRA models significantly outperform open-source models on 10 mathematical reasoning datasets across all scales. ToRA-Code-34B is the first open-source model that achieves an accuracy exceeding 50% on MATH.
arXiv Detail & Related papers (2023-09-29T17:59:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.