Related papers: Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

URL: http://arxiv.org/abs/2211.12588v4
Date: Mon, 23 Oct 2023 01:27:38 GMT
Title: Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
Authors: Wenhu Chen, Xueguang Ma, Xinyi Wang, William W. Cohen
Abstract summary: Chain-of-thoughts prompting (CoT) is by far the state-of-art method for these tasks. We propose Program of Thoughts' (PoT), which uses language models to express the reasoning process as a program. PoT can show an average performance gain over CoT by around 12% across all the evaluated datasets.
Score: 108.4568236569645
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, there has been significant progress in teaching language models to perform step-by-step reasoning to solve complex numerical reasoning tasks. Chain-of-thoughts prompting (CoT) is by far the state-of-art method for these tasks. CoT uses language models to perform both reasoning and computation in the multi-step `thought' process. To disentangle computation from reasoning, we propose `Program of Thoughts' (PoT), which uses language models (mainly Codex) to express the reasoning process as a program. The computation is relegated to an external computer, which executes the generated programs to derive the answer. We evaluate PoT on five math word problem datasets (GSM, AQuA, SVAMP, TabMWP, MultiArith) and three financial-QA datasets (FinQA, ConvFinQA, TATQA) for both few-shot and zero-shot setups. Under both few-shot and zero-shot settings, PoT can show an average performance gain over CoT by around 12\% across all the evaluated datasets. By combining PoT with self-consistency decoding, we can achieve SoTA performance on all math problem datasets and near-SoTA performance on financial datasets. All of our data and code are released in Github https://github.com/wenhuchen/Program-of-Thoughts

Related papers

Agentic-R1: Distilled Dual-Strategy Reasoning [44.848089301154026]
Current long chain-of-thought (long-CoT) models excel at mathematical reasoning but rely on slow and error-prone natural language traces.<n>We introduce a fine-tuning framework, DualDistill, that distills complementary reasoning strategies from multiple teachers into a unified student model.<n>Our method improves accuracy across a range of tasks, including both computation-intensive and standard benchmarks.
arXiv Detail & Related papers (2025-07-08T06:35:16Z)
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching [60.04718679054704]
We introduce Sketch-of-Thought (SoT), a novel prompting framework. It combines cognitive-inspired reasoning paradigms with linguistic constraints to minimize token usage. SoT achieves token reductions of 76% with negligible accuracy impact.
arXiv Detail & Related papers (2025-03-07T06:57:17Z)
ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning [54.70811660561151]
Existing math datasets evaluate the reasoning abilities of large language models (LLMs) by either using the final answer or the intermediate reasoning steps derived from static examples. We seek to use symbolic programs as a means for automated evaluation if a model can consistently produce correct final answers across various inputs to the program. We observe significant accuracy drops using our proposed evaluation compared with original static examples, suggesting the fragility of math reasoning in state-of-the-art LLMs.
arXiv Detail & Related papers (2024-10-24T18:02:37Z)
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning [55.52872152909785]
Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs) We show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks.
arXiv Detail & Related papers (2024-09-18T17:55:00Z)
How Do Humans Write Code? Large Models Do It the Same Way Too [14.954886191356342]
Program-of-Thought (PoT) replaces natural language-based Chain-of-Thought (CoT) as the most popular method in Large Language Models. Using PoT introduces more reasoning errors, such as incorrect formulas or flawed logic, compared to CoT. We propose Human-Think Language (HTL), which leverages a suite of strategies that help integrate PoT and CoT.
arXiv Detail & Related papers (2024-02-24T05:40:01Z)
Design of Chain-of-Thought in Math Problem Solving [8.582686316167973]
Chain-of-Thought (CoT) plays a crucial role in reasoning for math problem solving. We compare conventional natural language CoT with various program CoTs, including the self-describing program, the comment-describing program, and the non-describing program. We find that program CoTs often have superior effectiveness in math problem solving.
arXiv Detail & Related papers (2023-09-20T04:17:28Z)
Evaluating and Improving Tool-Augmented Computation-Intensive Math Reasoning [75.74103236299477]
Chain-of-thought prompting(CoT) and tool augmentation have been validated as effective practices for improving large language models. We propose a new approach that can deliberate the reasoning steps with tool interfaces, namely textbfDELI. Experimental results on CARP and six other datasets show that the proposed DELI mostly outperforms competitive baselines.
arXiv Detail & Related papers (2023-06-04T17:02:59Z)
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models [23.805926737723603]
A few manually crafted step-by-step reasoning demonstrations can be used to generate reasoning steps for large language models (LLMs) Zero-shot-CoTs prompts the target problem statement with "Let's think step by step" as an input prompt to LLMs. We show that our proposed zero-shot prompting consistently outperforms Zero-shot-CoT across all datasets by a large margin.
arXiv Detail & Related papers (2023-05-06T16:34:37Z)
ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering [70.6359636116848]
We propose a new large-scale dataset, ConvFinQA, to study the chain of numerical reasoning in conversational question answering. Our dataset poses great challenge in modeling long-range, complex numerical reasoning paths in real-world conversations.
arXiv Detail & Related papers (2022-10-07T23:48:50Z)
Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning [150.17907456113537]
We present Tabular Math Word Problems (TabMWP), a new dataset containing 38,431 grade-level problems that require mathematical reasoning. We evaluate different pre-trained models on TabMWP, including the GPT-3 model in a few-shot setting. We propose a novel approach, PromptPG, which utilizes policy gradient to learn to select in-context examples from a small amount of training data.
arXiv Detail & Related papers (2022-09-29T08:01:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.