Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with
Code-based Self-Verification
- URL: http://arxiv.org/abs/2308.07921v1
- Date: Tue, 15 Aug 2023 17:58:45 GMT
- Title: Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with
Code-based Self-Verification
- Authors: Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin,
Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, Hongsheng Li
- Abstract summary: OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets.
We propose a novel and effective prompting method, explicit ulinecode-based ulineself-ulineverification(CSV)
We achieve an impressive zero-shot accuracy on MATH dataset textbf(53.9% $to$ 84.3%).
- Score: 40.83776920225375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has
brought significant advancements in addressing math reasoning problems. In
particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter,
shows remarkable performance on challenging math datasets. In this paper, we
explore the effect of code on enhancing LLMs' reasoning capability by
introducing different constraints on the \textit{Code Usage Frequency} of GPT-4
Code Interpreter. We found that its success can be largely attributed to its
powerful skills in generating and executing code, evaluating the output of code
execution, and rectifying its solution when receiving unreasonable outputs.
Based on this insight, we propose a novel and effective prompting method,
explicit \uline{c}ode-based \uline{s}elf-\uline{v}erification~(CSV), to further
boost the mathematical reasoning potential of GPT-4 Code Interpreter. This
method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to
use code to self-verify its answers. In instances where the verification state
registers as ``False'', the model shall automatically amend its solution,
analogous to our approach of rectifying errors during a mathematics
examination. Furthermore, we recognize that the states of the verification
result indicate the confidence of a solution, which can improve the
effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we
achieve an impressive zero-shot accuracy on MATH dataset \textbf{(53.9\% $\to$
84.3\%)}.
Related papers
- LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose textbfMath-Minos, a natural language feedback enhanced verifier.
Our experiments reveal that a small set (30k) of natural language feedbacks can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z) - JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models [110.45794710162241]
Existing work either collects large-scale math-related texts for pre-training, or relies on stronger LLMs to synthesize massive math problems.
We propose an efficient way that trains a small LLM for math problem synthesis, to efficiently generate sufficient high-quality pre-training data.
We leverage it to synthesize 6 million math problems for pre-training our JiuZhang3.0 model, which only needs to invoke GPT-4 API 9.3k times and pre-train on 4.6B data.
arXiv Detail & Related papers (2024-05-23T09:43:19Z) - Feedback-Generation for Programming Exercises With GPT-4 [0.0]
This paper explores the quality of GPT-4 Turbo's generated output for prompts containing both the programming task specification and a student's submission as input.
The output was qualitatively analyzed regarding correctness, personalization, fault localization, and other features identified in the material.
arXiv Detail & Related papers (2024-03-07T12:37:52Z) - Whodunit: Classifying Code as Human Authored or GPT-4 Generated -- A
case study on CodeChef problems [0.13124513975412253]
We use code stylometry and machine learning to distinguish between GPT-4 generated and human-authored code.
Our dataset comprises human-authored solutions from CodeChef and AI-authored solutions generated by GPT-4.
Our study shows that code stylometry is a promising approach for distinguishing between GPT-4 generated code and human-authored code.
arXiv Detail & Related papers (2024-03-06T19:51:26Z) - Towards AI-Assisted Synthesis of Verified Dafny Methods [1.0187122752343796]
Existing large language models show a severe lack of proficiency in verified programming.
We show how to improve two pretrained models' proficiency in the Dafny verification-aware language.
arXiv Detail & Related papers (2024-02-01T00:07:23Z) - Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs [69.99031792995348]
We introduce code prompting, a chain of prompts that transforms a natural language problem into code.
We find that code prompting exhibits a high-performance boost for multiple LLMs.
Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement.
arXiv Detail & Related papers (2024-01-18T15:32:24Z) - Leveraging Print Debugging to Improve Code Generation in Large Language
Models [63.63160583432348]
Large language models (LLMs) have made significant progress in code generation tasks.
But their performance in tackling programming problems with complex data structures and algorithms remains suboptimal.
We propose an in-context learning approach that guides LLMs to debug by using a "print debug" method.
arXiv Detail & Related papers (2024-01-10T18:37:59Z) - MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical
Reasoning [52.97768001837269]
We present a method to fine-tune open-source language models, enabling them to use code for modeling and deriving math equations.
We propose a method of generating novel and high-quality datasets with math problems and their code-based solutions.
This approach yields the MathCoder models, a family of models capable of generating code-based solutions for solving challenging math problems.
arXiv Detail & Related papers (2023-10-05T17:52:09Z) - Reformulating Domain Adaptation of Large Language Models as
Adapt-Retrieve-Revise [34.4546877502907]
GPT-4 can generate content with hallucinations in specific domains such as Chinese law, hindering their application in these areas.
This paper introduces a simple and effective domain adaptation framework for GPT-4 by reformulating generation as an textbfadapt-retrieve-revise process.
In the zero-shot setting of four Chinese legal tasks, our method improves accuracy by 33.3% compared to the direct generation by GPT-4.
arXiv Detail & Related papers (2023-10-05T05:55:06Z) - AI-assisted coding: Experiments with GPT-4 [0.22366638308792727]
GPT-4 can generate tests with substantial coverage, but that many of the tests fail applied to the associated code.
These findings suggest that while AI coding tools are very powerful, they still require humans in the loop to ensure validity and accuracy of the results.
arXiv Detail & Related papers (2023-04-25T22:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.