PAL: Program-aided Language Models
- URL: http://arxiv.org/abs/2211.10435v1
- Date: Fri, 18 Nov 2022 18:56:13 GMT
- Title: PAL: Program-aided Language Models
- Authors: Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming
Yang, Jamie Callan, Graham Neubig
- Abstract summary: We present Program-Aided Language models (PaL) to understand natural language problems.
PaL offloads the solution step to a programmatic runtime such as a Python interpreter.
We set new state-of-the-art results in all 12 benchmarks.
- Score: 112.94785609781503
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Large language models (LLMs) have recently demonstrated an impressive ability
to perform arithmetic and symbolic reasoning tasks when provided with a few
examples at test time (few-shot prompting). Much of this success can be
attributed to prompting methods for reasoning, such as chain-of-thought, that
employ LLMs for both understanding the problem description by decomposing it
into steps, as well as solving each step of the problem. While LLMs seem to be
adept at this sort of step-by-step decomposition, LLMs often make logical and
arithmetic mistakes in the solution part, even when the problem is correctly
decomposed. We present Program-Aided Language models (PaL): a new method that
uses the LLM to understand natural language problems and generate programs as
the intermediate reasoning steps, but offloads the solution step to a
programmatic runtime such as a Python interpreter. With PaL, decomposing the
natural language problem into runnable steps remains the only learning task for
the LLM, while solving is delegated to the interpreter. We experiment with 12
reasoning tasks from BIG-Bench Hard and other benchmarks, including
mathematical reasoning, symbolic reasoning, and algorithmic problems. In all
these natural language reasoning tasks, generating code using an LLM and
reasoning using a Python interpreter leads to more accurate results than much
larger models, and we set new state-of-the-art results in all 12 benchmarks.
For example, PaL using Codex achieves state-of-the-art few-shot accuracy on the
GSM benchmark of math word problems when the model is allowed only a single
decoding, surpassing PaLM-540B with chain-of-thought prompting by an absolute
8% .In three reasoning tasks from the BIG-Bench Hard benchmark, PaL outperforms
CoT by 11%. On GSM-hard, a more challenging version of GSM that we create, PaL
outperforms chain-of-thought by an absolute 40%.
Related papers
- BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [72.56339136017759]
We introduce BigCodeBench, a benchmark that challenges Large Language Models (LLMs) to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks.
Our evaluation shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%.
We propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information.
arXiv Detail & Related papers (2024-06-22T15:52:04Z) - OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step [7.7168728919692855]
We propose a framework that enables exact arithmetic in a single autoregressive step.
We use the hidden states of a LLM to control a symbolic architecture that performs arithmetic.
Our implementation using Llama 3 with OccamNet as a symbolic model (OccamLlama) achieves 100% accuracy on single arithmetic operations.
arXiv Detail & Related papers (2024-06-04T04:17:40Z) - Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems [50.76385564061713]
Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks.
CoT usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors.
We propose Deeply Understanding the Problems (DUP) to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors.
arXiv Detail & Related papers (2024-04-23T12:16:05Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z) - Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners? [140.9751389452011]
We study the biases of large language models (LLMs) in relation to those known in children when solving arithmetic word problems.
We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features.
arXiv Detail & Related papers (2024-01-31T18:48:20Z) - Zero-Shot Question Answering over Financial Documents using Large
Language Models [0.18749305679160366]
We introduce a large language model (LLM) based approach to answer complex questions requiring multi-hop numerical reasoning over financial reports.
We use novel zero-shot prompts that guide the LLM to encode the required reasoning into a Python program or a domain specific language.
arXiv Detail & Related papers (2023-11-19T16:23:34Z) - Coupling Large Language Models with Logic Programming for Robust and
General Reasoning from Text [5.532477732693001]
We show that a large language model can serve as a highly effective few-shot semantically.
It can convert natural language sentences into a logical form that serves as input for answer set programs.
We demonstrate that this method achieves state-of-the-art performance on several benchmarks, including bAbI, StepGame, CLUTRR, and gSCAN.
arXiv Detail & Related papers (2023-07-15T03:29:59Z) - SatLM: Satisfiability-Aided Language Models Using Declarative Prompting [68.40726892904286]
We propose a new satisfiability-aided language modeling (SatLM) approach for improving the reasoning capabilities of large language models (LLMs)
We use an LLM to generate a declarative task specification rather than an imperative program and leverage an off-the-shelf automated theorem prover to derive the final answer.
We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm.
arXiv Detail & Related papers (2023-05-16T17:55:51Z) - MathPrompter: Mathematical Reasoning using Large Language Models [7.953723258038284]
Large Language Models (LLMs) have limited performance when solving arithmetic reasoning tasks.
MathPrompter uses the Zero-shot chain-of-thought prompting technique to generate multiple Algebraic expressions or Python functions to solve the same math problem in different ways.
arXiv Detail & Related papers (2023-03-04T04:43:49Z) - Large Language Models are Zero-Shot Reasoners [28.6899375595088]
Chain of thought (CoT) prompting is a technique for eliciting complex multi-step reasoning through step-by-step answer examples.
We show that LLMs are decent zero-shot reasoners by simply adding Let's think step by step'' before each answer.
Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances.
arXiv Detail & Related papers (2022-05-24T09:22:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.