Mathematical Capabilities of ChatGPT
- URL: http://arxiv.org/abs/2301.13867v2
- Date: Thu, 20 Jul 2023 17:59:14 GMT
- Title: Mathematical Capabilities of ChatGPT
- Authors: Simon Frieder, Luca Pinchetti, Alexis Chevalier, Ryan-Rhys Griffiths,
Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, Julius
Berner
- Abstract summary: We release two new datasets: GHOSTS and miniGHOSTS.
These are the first natural-language datasets curated by working researchers in mathematics.
We benchmark the models on a range of fine-grained performance metrics.
- Score: 35.71603158908465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the mathematical capabilities of two iterations of ChatGPT
(released 9-January-2023 and 30-January-2023) and of GPT-4 by testing them on
publicly available datasets, as well as hand-crafted ones, using a novel
methodology. In contrast to formal mathematics, where large databases of formal
proofs are available (e.g., the Lean Mathematical Library), current datasets of
natural-language mathematics, used to benchmark language models, either cover
only elementary mathematics or are very small. We address this by publicly
releasing two new datasets: GHOSTS and miniGHOSTS. These are the first
natural-language datasets curated by working researchers in mathematics that
(1) aim to cover graduate-level mathematics, (2) provide a holistic overview of
the mathematical capabilities of language models, and (3) distinguish multiple
dimensions of mathematical reasoning. These datasets also test whether ChatGPT
and GPT-4 can be helpful assistants to professional mathematicians by emulating
use cases that arise in the daily professional activities of mathematicians. We
benchmark the models on a range of fine-grained performance metrics. For
advanced mathematics, this is the most detailed evaluation effort to date. We
find that ChatGPT can be used most successfully as a mathematical assistant for
querying facts, acting as a mathematical search engine and knowledge base
interface. GPT-4 can additionally be used for undergraduate-level mathematics
but fails on graduate-level difficulty. Contrary to many positive reports in
the media about GPT-4 and ChatGPT's exam-solving abilities (a potential case of
selection bias), their overall mathematical performance is well below the level
of a graduate student. Hence, if your goal is to use ChatGPT to pass a
graduate-level math exam, you would be better off copying from your average
peer!
Related papers
- MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark [82.64129627675123]
MathBench is a new benchmark that rigorously assesses the mathematical capabilities of large language models.
MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills.
arXiv Detail & Related papers (2024-05-20T17:52:29Z) - MathScale: Scaling Instruction Tuning for Mathematical Reasoning [70.89605383298331]
Large language models (LLMs) have demonstrated remarkable capabilities in problem-solving.
However, their proficiency in solving mathematical problems remains inadequate.
We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data.
arXiv Detail & Related papers (2024-03-05T11:42:59Z) - MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning [2.9104279358536647]
We present MathSensei, a tool-augmented large language model for mathematical reasoning.
We study the complementary benefits of the tools - knowledge retriever (Bing Web Search), program generator + executor (Python), and symbolic equation solver (Wolfram-Alpha API)
arXiv Detail & Related papers (2024-02-27T05:50:35Z) - InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning [98.53491178426492]
We open-source our math reasoning LLMs InternLM-Math which is continue pre-trained from InternLM2.
We unify chain-of-thought reasoning, reward modeling, formal reasoning, data augmentation, and code interpreter in a unified seq2seq format.
Our pre-trained model achieves 30.3 on the MiniF2F test set without fine-tuning.
arXiv Detail & Related papers (2024-02-09T11:22:08Z) - MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical
Reasoning [52.97768001837269]
We present a method to fine-tune open-source language models, enabling them to use code for modeling and deriving math equations.
We propose a method of generating novel and high-quality datasets with math problems and their code-based solutions.
This approach yields the MathCoder models, a family of models capable of generating code-based solutions for solving challenging math problems.
arXiv Detail & Related papers (2023-10-05T17:52:09Z) - WizardMath: Empowering Mathematical Reasoning for Large Language Models
via Reinforced Evol-Instruct [128.89645483139236]
We present WizardMath, which enhances the mathematical reasoning abilities of Llama-2, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math.
Our model even surpasses ChatGPT-3.5, Claude Instant-1, PaLM-2 and Minerva on GSM8k, simultaneously surpasses Text-davinci, PaLM-1 and GPT-3 on MATH.
arXiv Detail & Related papers (2023-08-18T14:23:21Z) - Math Agents: Computational Infrastructure, Mathematical Embedding, and
Genomics [0.0]
Beyond human-AI chat, large language models (LLMs) are emerging in programming, algorithm discovery, and theorem proving.
This project introduces Math Agents and mathematical embedding as fresh entries to the "Moore's Law of Mathematics"
Project aims to use Math Agents and mathematical embeddings to address the ageing issue in information systems biology.
arXiv Detail & Related papers (2023-07-04T20:16:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.