Reasoning Like Program Executors
- URL: http://arxiv.org/abs/2201.11473v1
- Date: Thu, 27 Jan 2022 12:28:24 GMT
- Title: Reasoning Like Program Executors
- Authors: Xinyu Pi, Qian Liu, Bei Chen, Morteza Ziyadi, Zeqi Lin, Yan Gao, Qiang
Fu, Jian-Guang Lou, Weizhu Chen
- Abstract summary: POET empowers language models to harvest the reasoning knowledge possessed in program executors via a data-driven approach.
PoET can significantly boost model performance on natural language reasoning.
PoET opens a new gate on reasoning-enhancement pre-training.
- Score: 48.819113224699976
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Reasoning over natural language is a long-standing goal for the research
community. However, studies have shown that existing language models are
inadequate in reasoning. To address the issue, we present POET, a new
pre-training paradigm. Through pre-training language models with programs and
their execution results, POET empowers language models to harvest the reasoning
knowledge possessed in program executors via a data-driven approach. POET is
conceptually simple and can be instantiated by different kinds of programs. In
this paper, we show three empirically powerful instances, i.e., POET-Math,
POET-Logic, and POET-SQL. Experimental results on six benchmarks demonstrate
that POET can significantly boost model performance on natural language
reasoning, such as numerical reasoning, logical reasoning, and multi-hop
reasoning. Taking the DROP benchmark as a representative example, POET improves
the F1 metric of BART from 69.2% to 80.6%. Furthermore, POET shines in giant
language models, pushing the F1 metric of T5-11B to 87.6% and achieving a new
state-of-the-art performance on DROP. POET opens a new gate on
reasoning-enhancement pre-training and we hope our analysis would shed light on
the future research of reasoning like program executors.
Related papers
- Which Programming Language and What Features at Pre-training Stage Affect Downstream Logical Inference Performance? [26.91104188917787]
Large language models (LLMs) have demonstrated remarkable generalization abilities in mathematics and logical reasoning tasks.
Our research aims to verify which programming languages and features during pre-training affect logical inference performance.
arXiv Detail & Related papers (2024-10-09T10:13:13Z) - Proceedings of the First International Workshop on Next-Generation Language Models for Knowledge Representation and Reasoning (NeLaMKRR 2024) [16.282850445579857]
Reasoning is an essential component of human intelligence as it plays a fundamental role in our ability to think critically.
Recent leap forward in natural language processing, with the emergence of language models based on transformers, is hinting at the possibility that these models exhibit reasoning abilities.
Despite ongoing discussions about what reasoning is in language models, it is still not easy to pin down to what extent these models are actually capable of reasoning.
arXiv Detail & Related papers (2024-10-07T02:31:47Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - On the Reliability and Explainability of Language Models for Program
Generation [15.569926313298337]
We study the capabilities and limitations of automated program generation approaches.
We employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code transformation.
Our analysis reveals that, in various experimental scenarios, language models can recognize code grammar and structural information, but they exhibit limited robustness to changes in input sequences.
arXiv Detail & Related papers (2023-02-19T14:59:52Z) - Making Large Language Models Better Reasoners with Step-Aware Verifier [49.16750018427259]
DIVERSE (Diverse Verifier on Reasoning Step) is a novel approach that further enhances the reasoning capability of language models.
We evaluate DIVERSE on the latest language model code-davinci and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks.
arXiv Detail & Related papers (2022-06-06T03:38:36Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - ANNA: Enhanced Language Representation for Question Answering [5.713808202873983]
We show how approaches affect performance individually and that the approaches are jointly considered in pre-training models.
We propose an extended pre-training task, and a new neighbor-aware mechanism that attends neighboring tokens more to capture the richness of context for pre-training language modeling.
Our best model achieves new state-of-the-art results of 95.7% F1 and 90.6% EM on SQuAD 1.1 and also outperforms existing pre-trained language models such as RoBERTa, ALBERT, ELECTRA, and XLNet.
arXiv Detail & Related papers (2022-03-28T05:26:52Z) - Enforcing Consistency in Weakly Supervised Semantic Parsing [68.2211621631765]
We explore the use of consistency between the output programs for related inputs to reduce the impact of spurious programs.
We find that a more consistent formalism leads to improved model performance even without consistency-based training.
arXiv Detail & Related papers (2021-07-13T03:48:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.