MathPrompter: Mathematical Reasoning using Large Language Models
- URL: http://arxiv.org/abs/2303.05398v1
- Date: Sat, 4 Mar 2023 04:43:49 GMT
- Title: MathPrompter: Mathematical Reasoning using Large Language Models
- Authors: Shima Imani, Liang Du, Harsh Shrivastava
- Abstract summary: Large Language Models (LLMs) have limited performance when solving arithmetic reasoning tasks.
MathPrompter uses the Zero-shot chain-of-thought prompting technique to generate multiple Algebraic expressions or Python functions to solve the same math problem in different ways.
- Score: 7.953723258038284
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large Language Models (LLMs) have limited performance when solving arithmetic
reasoning tasks and often provide incorrect answers. Unlike natural language
understanding, math problems typically have a single correct answer, making the
task of generating accurate solutions more challenging for LLMs. To the best of
our knowledge, we are not aware of any LLMs that indicate their level of
confidence in their responses which fuels a trust deficit in these models
impeding their adoption. To address this deficiency, we propose `MathPrompter',
a technique that improves performance of LLMs on arithmetic problems along with
increased reliance in the predictions. MathPrompter uses the Zero-shot
chain-of-thought prompting technique to generate multiple Algebraic expressions
or Python functions to solve the same math problem in different ways and
thereby raise the confidence level in the output results. This is in contrast
to other prompt based CoT methods, where there is no check on the validity of
the intermediate steps followed. Our technique improves over state-of-the-art
on the MultiArith dataset ($78.7\%\rightarrow92.5\%$) evaluated using 175B
parameter GPT-based LLM.
Related papers
- MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - MathDivide: Improved mathematical reasoning by large language models [0.0]
We propose a prompting technique called MathDivide that breaks down the mathematical problem into simpler subproblems.
The results demonstrate that MathDivide was able to significantly outperform the leading prompting technique called Math-prompter.
arXiv Detail & Related papers (2024-05-12T20:21:15Z) - Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems [86.03285157412839]
Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks.
CoT usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors and step-missing errors.
We propose Deeply Understanding the Problems (DUP) to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors.
arXiv Detail & Related papers (2024-04-23T12:16:05Z) - MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? [99.0305256706604]
We introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs.
We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources.
This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning.
arXiv Detail & Related papers (2024-03-21T17:59:50Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z) - SatLM: Satisfiability-Aided Language Models Using Declarative Prompting [68.40726892904286]
We propose a new satisfiability-aided language modeling (SatLM) approach for improving the reasoning capabilities of large language models (LLMs)
We use an LLM to generate a declarative task specification rather than an imperative program and leverage an off-the-shelf automated theorem prover to derive the final answer.
We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm.
arXiv Detail & Related papers (2023-05-16T17:55:51Z) - PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems.
PaL offloads the solution step to a programmatic runtime such as a Python interpreter.
We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.