How well do Large Language Models perform in Arithmetic tasks?
- URL: http://arxiv.org/abs/2304.02015v1
- Date: Thu, 16 Mar 2023 09:28:15 GMT
- Title: How well do Large Language Models perform in Arithmetic tasks?
- Authors: Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang
- Abstract summary: Large language models have emerged abilities including chain-of-thought to answer math word problems step by step.
To the best of our knowledge, there is no work to focus on evaluating the arithmetic ability of large language models.
In this work, we propose an arithmetic dataset MATH 401 to test the latest large language models.
- Score: 25.638682874990206
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models have emerged abilities including chain-of-thought to
answer math word problems step by step. Solving math word problems not only
requires abilities to disassemble problems via chain-of-thought but also needs
to calculate arithmetic expressions correctly for each step. To the best of our
knowledge, there is no work to focus on evaluating the arithmetic ability of
large language models. In this work, we propose an arithmetic dataset MATH 401
to test the latest large language models including GPT-4, ChatGPT, InstrctGPT,
Galactica, and LLaMA with various arithmetic expressions and provide a detailed
analysis of the ability of large language models. MATH 401 and evaluation codes
are released at \url{https://github.com/GanjinZero/math401-llm}.
Related papers
- Lean Workbook: A large-scale Lean problem set formalized from natural language math problems [50.22847430754973]
Large language models are not good at math theorem proving using formal languages like Lean.
A significant challenge in this area is the scarcity of training data available in these formal languages.
We propose a novel pipeline that iteratively generates and filters synthetic data to translate natural language mathematical problems into Lean 4 statements.
arXiv Detail & Related papers (2024-06-06T08:25:43Z) - OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step [7.7168728919692855]
We propose a framework that enables exact arithmetic in a single autoregressive step.
We use the hidden states of a LLM to control a symbolic architecture that performs arithmetic.
Our implementation using Llama 3 with OccamNet as a symbolic model (OccamLlama) achieves 100% accuracy on single arithmetic operations.
arXiv Detail & Related papers (2024-06-04T04:17:40Z) - MathScale: Scaling Instruction Tuning for Mathematical Reasoning [70.89605383298331]
Large language models (LLMs) have demonstrated remarkable capabilities in problem-solving.
However, their proficiency in solving mathematical problems remains inadequate.
We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data.
arXiv Detail & Related papers (2024-03-05T11:42:59Z) - MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical
Reasoning [52.97768001837269]
We present a method to fine-tune open-source language models, enabling them to use code for modeling and deriving math equations.
We propose a method of generating novel and high-quality datasets with math problems and their code-based solutions.
This approach yields the MathCoder models, a family of models capable of generating code-based solutions for solving challenging math problems.
arXiv Detail & Related papers (2023-10-05T17:52:09Z) - MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models [91.66694225955872]
We propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning.
Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge.
We release all the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.
arXiv Detail & Related papers (2023-09-21T17:45:42Z) - GPT Can Solve Mathematical Problems Without a Calculator [24.114064917059565]
We show that a large language model can accurately perform arithmetic operations with almost 100% accuracy without data leakage.
We also demonstrate that our MathGLM, fine-tuned from GLM-10B, achieves similar performance to GPT-4 on a 5,000-samples Chinese math problem test set.
arXiv Detail & Related papers (2023-09-06T06:18:16Z) - WizardMath: Empowering Mathematical Reasoning for Large Language Models
via Reinforced Evol-Instruct [128.89645483139236]
We present WizardMath, which enhances the mathematical reasoning abilities of Llama-2, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math.
Our model even surpasses ChatGPT-3.5, Claude Instant-1, PaLM-2 and Minerva on GSM8k, simultaneously surpasses Text-davinci, PaLM-1 and GPT-3 on MATH.
arXiv Detail & Related papers (2023-08-18T14:23:21Z) - MathPrompter: Mathematical Reasoning using Large Language Models [7.953723258038284]
Large Language Models (LLMs) have limited performance when solving arithmetic reasoning tasks.
MathPrompter uses the Zero-shot chain-of-thought prompting technique to generate multiple Algebraic expressions or Python functions to solve the same math problem in different ways.
arXiv Detail & Related papers (2023-03-04T04:43:49Z) - PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems.
PaL offloads the solution step to a programmatic runtime such as a Python interpreter.
We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.