Program of Thoughts Prompting: Disentangling Computation from Reasoning
for Numerical Reasoning Tasks
- URL: http://arxiv.org/abs/2211.12588v4
- Date: Mon, 23 Oct 2023 01:27:38 GMT
- Title: Program of Thoughts Prompting: Disentangling Computation from Reasoning
for Numerical Reasoning Tasks
- Authors: Wenhu Chen, Xueguang Ma, Xinyi Wang, William W. Cohen
- Abstract summary: Chain-of-thoughts prompting (CoT) is by far the state-of-art method for these tasks.
We propose Program of Thoughts' (PoT), which uses language models to express the reasoning process as a program.
PoT can show an average performance gain over CoT by around 12% across all the evaluated datasets.
- Score: 108.4568236569645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, there has been significant progress in teaching language models to
perform step-by-step reasoning to solve complex numerical reasoning tasks.
Chain-of-thoughts prompting (CoT) is by far the state-of-art method for these
tasks. CoT uses language models to perform both reasoning and computation in
the multi-step `thought' process. To disentangle computation from reasoning, we
propose `Program of Thoughts' (PoT), which uses language models (mainly Codex)
to express the reasoning process as a program. The computation is relegated to
an external computer, which executes the generated programs to derive the
answer. We evaluate PoT on five math word problem datasets (GSM, AQuA, SVAMP,
TabMWP, MultiArith) and three financial-QA datasets (FinQA, ConvFinQA, TATQA)
for both few-shot and zero-shot setups. Under both few-shot and zero-shot
settings, PoT can show an average performance gain over CoT by around 12\%
across all the evaluated datasets. By combining PoT with self-consistency
decoding, we can achieve SoTA performance on all math problem datasets and
near-SoTA performance on financial datasets. All of our data and code are
released in Github https://github.com/wenhuchen/Program-of-Thoughts
Related papers
- ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning [54.70811660561151]
Existing math datasets evaluate the reasoning abilities of large language models (LLMs) by either using the final answer or the intermediate reasoning steps derived from static examples.
We seek to use symbolic programs as a means for automated evaluation if a model can consistently produce correct final answers across various inputs to the program.
We observe significant accuracy drops using our proposed evaluation compared with original static examples, suggesting the fragility of math reasoning in state-of-the-art LLMs.
arXiv Detail & Related papers (2024-10-24T18:02:37Z) - To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning [55.52872152909785]
Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs)
We show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks.
arXiv Detail & Related papers (2024-09-18T17:55:00Z) - How Do Humans Write Code? Large Models Do It the Same Way Too [14.954886191356342]
Program-of-Thought (PoT) replaces natural language-based Chain-of-Thought (CoT) as the most popular method in Large Language Models.
Using PoT introduces more reasoning errors, such as incorrect formulas or flawed logic, compared to CoT.
We propose Human-Think Language (HTL), which leverages a suite of strategies that help integrate PoT and CoT.
arXiv Detail & Related papers (2024-02-24T05:40:01Z) - Design of Chain-of-Thought in Math Problem Solving [8.582686316167973]
Chain-of-Thought (CoT) plays a crucial role in reasoning for math problem solving.
We compare conventional natural language CoT with various program CoTs, including the self-describing program, the comment-describing program, and the non-describing program.
We find that program CoTs often have superior effectiveness in math problem solving.
arXiv Detail & Related papers (2023-09-20T04:17:28Z) - Evaluating and Improving Tool-Augmented Computation-Intensive Math
Reasoning [75.74103236299477]
Chain-of-thought prompting(CoT) and tool augmentation have been validated as effective practices for improving large language models.
We propose a new approach that can deliberate the reasoning steps with tool interfaces, namely textbfDELI.
Experimental results on CARP and six other datasets show that the proposed DELI mostly outperforms competitive baselines.
arXiv Detail & Related papers (2023-06-04T17:02:59Z) - Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning
by Large Language Models [23.805926737723603]
A few manually crafted step-by-step reasoning demonstrations can be used to generate reasoning steps for large language models (LLMs)
Zero-shot-CoTs prompts the target problem statement with "Let's think step by step" as an input prompt to LLMs.
We show that our proposed zero-shot prompting consistently outperforms Zero-shot-CoT across all datasets by a large margin.
arXiv Detail & Related papers (2023-05-06T16:34:37Z) - ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational
Finance Question Answering [70.6359636116848]
We propose a new large-scale dataset, ConvFinQA, to study the chain of numerical reasoning in conversational question answering.
Our dataset poses great challenge in modeling long-range, complex numerical reasoning paths in real-world conversations.
arXiv Detail & Related papers (2022-10-07T23:48:50Z) - Dynamic Prompt Learning via Policy Gradient for Semi-structured
Mathematical Reasoning [150.17907456113537]
We present Tabular Math Word Problems (TabMWP), a new dataset containing 38,431 grade-level problems that require mathematical reasoning.
We evaluate different pre-trained models on TabMWP, including the GPT-3 model in a few-shot setting.
We propose a novel approach, PromptPG, which utilizes policy gradient to learn to select in-context examples from a small amount of training data.
arXiv Detail & Related papers (2022-09-29T08:01:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.