Reasoning Like Program Executors
- URL: http://arxiv.org/abs/2201.11473v1
- Date: Thu, 27 Jan 2022 12:28:24 GMT
- Title: Reasoning Like Program Executors
- Authors: Xinyu Pi, Qian Liu, Bei Chen, Morteza Ziyadi, Zeqi Lin, Yan Gao, Qiang
Fu, Jian-Guang Lou, Weizhu Chen
- Abstract summary: POET empowers language models to harvest the reasoning knowledge possessed in program executors via a data-driven approach.
PoET can significantly boost model performance on natural language reasoning.
PoET opens a new gate on reasoning-enhancement pre-training.
- Score: 48.819113224699976
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Reasoning over natural language is a long-standing goal for the research
community. However, studies have shown that existing language models are
inadequate in reasoning. To address the issue, we present POET, a new
pre-training paradigm. Through pre-training language models with programs and
their execution results, POET empowers language models to harvest the reasoning
knowledge possessed in program executors via a data-driven approach. POET is
conceptually simple and can be instantiated by different kinds of programs. In
this paper, we show three empirically powerful instances, i.e., POET-Math,
POET-Logic, and POET-SQL. Experimental results on six benchmarks demonstrate
that POET can significantly boost model performance on natural language
reasoning, such as numerical reasoning, logical reasoning, and multi-hop
reasoning. Taking the DROP benchmark as a representative example, POET improves
the F1 metric of BART from 69.2% to 80.6%. Furthermore, POET shines in giant
language models, pushing the F1 metric of T5-11B to 87.6% and achieving a new
state-of-the-art performance on DROP. POET opens a new gate on
reasoning-enhancement pre-training and we hope our analysis would shed light on
the future research of reasoning like program executors.
Related papers
- EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking [54.354203142828084]
We present the task of equivalence checking as a new way to evaluate the code reasoning abilities of large language models.
We introduce EquiBench, a dataset of 2400 program pairs spanning four programming languages and six equivalence categories.
Our evaluation of 17 state-of-the-art LLMs shows that OpenAI o3-mini achieves the highest overall accuracy of 78.0%.
arXiv Detail & Related papers (2025-02-18T02:54:25Z) - Inductive Linguistic Reasoning with Large Language Models [0.0]
We investigate the abilities of large language models to perform abstract multilingual reasoning through the lens of linguistic puzzles.
We employ a two-stage procedure, first generating analogical exemplars with a language model, and then applying them in-context.
Our results on the modeLing dataset show that analogical prompting is effective in eliciting models' knowledge of language grammar similarities.
arXiv Detail & Related papers (2024-12-09T03:37:11Z) - Which Programming Language and What Features at Pre-training Stage Affect Downstream Logical Inference Performance? [26.91104188917787]
Large language models (LLMs) have demonstrated remarkable generalization abilities in mathematics and logical reasoning tasks.
Our research aims to verify which programming languages and features during pre-training affect logical inference performance.
arXiv Detail & Related papers (2024-10-09T10:13:13Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - On the Reliability and Explainability of Language Models for Program
Generation [15.569926313298337]
We study the capabilities and limitations of automated program generation approaches.
We employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code transformation.
Our analysis reveals that, in various experimental scenarios, language models can recognize code grammar and structural information, but they exhibit limited robustness to changes in input sequences.
arXiv Detail & Related papers (2023-02-19T14:59:52Z) - Making Large Language Models Better Reasoners with Step-Aware Verifier [49.16750018427259]
DIVERSE (Diverse Verifier on Reasoning Step) is a novel approach that further enhances the reasoning capability of language models.
We evaluate DIVERSE on the latest language model code-davinci and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks.
arXiv Detail & Related papers (2022-06-06T03:38:36Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - ANNA: Enhanced Language Representation for Question Answering [5.713808202873983]
We show how approaches affect performance individually and that the approaches are jointly considered in pre-training models.
We propose an extended pre-training task, and a new neighbor-aware mechanism that attends neighboring tokens more to capture the richness of context for pre-training language modeling.
Our best model achieves new state-of-the-art results of 95.7% F1 and 90.6% EM on SQuAD 1.1 and also outperforms existing pre-trained language models such as RoBERTa, ALBERT, ELECTRA, and XLNet.
arXiv Detail & Related papers (2022-03-28T05:26:52Z) - Enforcing Consistency in Weakly Supervised Semantic Parsing [68.2211621631765]
We explore the use of consistency between the output programs for related inputs to reduce the impact of spurious programs.
We find that a more consistent formalism leads to improved model performance even without consistency-based training.
arXiv Detail & Related papers (2021-07-13T03:48:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.