Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange
- URL: http://arxiv.org/abs/2404.00344v1
- Date: Sat, 30 Mar 2024 12:48:31 GMT
- Title: Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange
- Authors: Ankit Satpute, Noah Giessing, Andre Greiner-Petter, Moritz Schubotz, Olaf Teschke, Akiko Aizawa, Bela Gipp,
- Abstract summary: Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks.
This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving.
- Score: 25.419977967846144
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this study, we adopted a two-step approach for investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned for answering mathematics questions and outperforms the current best approach on ArqMATH3 Task1, considering P@10. Our Case analysis indicates that while the GPT-4 can generate relevant responses in certain instances, it does not consistently answer all questions accurately. This paper explores the current limitations of LLMs in navigating complex mathematical problem-solving. Through case analysis, we shed light on the gaps in LLM capabilities within mathematics, thereby setting the stage for future research and advancements in AI-driven mathematical reasoning. We make our code and findings publicly available for research: \url{https://github.com/gipplab/LLM-Investig-MathStackExchange}
Related papers
- HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics [1.5716764919736026]
We introduce HARDMath, a dataset featuring challenging applied mathematics problems that require analytical approximation techniques.
Our framework auto-generates a large number of problems with solutions validated against numerical ground truths.
We evaluate both open- and closed-source LLMs on HARDMath-mini, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts.
arXiv Detail & Related papers (2024-10-13T20:09:41Z) - MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs [61.74749961334557]
MathHay is an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs.
We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing models.
arXiv Detail & Related papers (2024-10-07T02:30:07Z) - AI-Assisted Generation of Difficult Math Questions [78.7547836422727]
Current training positions mathematical reasoning as a core capability.
There is unmet demand for diverse and challenging math questions.
We present a design framework that combines the strengths of LLMs with a human-in-the-loop approach.
arXiv Detail & Related papers (2024-07-30T17:55:36Z) - SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading [100.02175403852253]
One common use of Large Language Models (LLMs) is performing tasks on scientific topics.
Inspired by the way university students are evaluated on such tasks, we propose SciEx - a benchmark consisting of university computer science exam questions.
We evaluate the performance of various state-of-the-art LLMs on our new benchmark.
arXiv Detail & Related papers (2024-06-14T21:52:21Z) - LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs [8.89259409245068]
Large language models (LLMs) demonstrate impressive capabilities in mathematical reasoning.
We present the Mathematical Topics Tree (MaTT) benchmark, a challenging and structured benchmark that offers 1,958 questions.
We find that the most advanced model, GPT-4, achieved a mere 54% accuracy in a multiple-choice scenario.
arXiv Detail & Related papers (2024-06-07T18:21:26Z) - MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark [82.64129627675123]
MathBench is a new benchmark that rigorously assesses the mathematical capabilities of large language models.
MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills.
arXiv Detail & Related papers (2024-05-20T17:52:29Z) - ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline [42.61538071832468]
Large language models (LLMs) have shown excellent mastering of human language, but still struggle in real-world applications that require mathematical problem-solving.
We tailor the Self-Critique pipeline, which addresses the challenge in the feedback learning stage of LLM alignment.
arXiv Detail & Related papers (2024-04-03T17:51:18Z) - MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? [99.0305256706604]
We introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs.
We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources.
This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning.
arXiv Detail & Related papers (2024-03-21T17:59:50Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.