Testing the General Deductive Reasoning Capacity of Large Language
Models Using OOD Examples
- URL: http://arxiv.org/abs/2305.15269v3
- Date: Fri, 3 Nov 2023 18:45:56 GMT
- Title: Testing the General Deductive Reasoning Capacity of Large Language
Models Using OOD Examples
- Authors: Abulhair Saparov, Richard Yuanzhe Pang, Vishakh Padmakumar, Nitish
Joshi, Seyed Mehran Kazemi, Najoung Kim, He He
- Abstract summary: Large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts.
We test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations.
Experiments on four LLMs of various sizes and training objectives show that they are able to generalize to compositional proofs.
- Score: 36.63316546586304
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Given the intractably large size of the space of proofs, any model that is
capable of general deductive reasoning must generalize to proofs of greater
complexity. Recent studies have shown that large language models (LLMs) possess
some abstract deductive reasoning ability given chain-of-thought prompts.
However, they have primarily been tested on proofs using modus ponens or of a
specific size, and from the same distribution as the in-context examples. To
measure the general deductive reasoning ability of LLMs, we test on a broad set
of deduction rules and measure their ability to generalize to more complex
proofs from simpler demonstrations from multiple angles: depth-, width-, and
compositional generalization. To facilitate systematic exploration, we
construct a new synthetic and programmable reasoning dataset that enables
control over deduction rules and proof complexity. Our experiments on four LLMs
of various sizes and training objectives show that they are able to generalize
to compositional proofs. However, they have difficulty generalizing to longer
proofs, and they require explicit demonstrations to produce hypothetical
subproofs, specifically in proof by cases and proof by contradiction.
Related papers
- MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs [80.96119560172224]
Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to problems that are more complex than the ones on which they have been trained.
We present a framework for evaluating LLMs on problems with arbitrarily complex arithmetic proofs, called MathGAP.
arXiv Detail & Related papers (2024-10-17T12:48:14Z) - Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs [99.76347807139615]
Reasoning encompasses two typical types: deductive reasoning and inductive reasoning.
Despite extensive research into the reasoning capabilities of Large Language Models (LLMs), most studies have failed to rigorously differentiate between inductive and deductive reasoning.
This raises an essential question: In LLM reasoning, which poses a greater challenge - deductive or inductive reasoning?
arXiv Detail & Related papers (2024-07-31T18:47:11Z) - Lean-STaR: Learning to Interleave Thinking and Proving [53.923617816215774]
We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof.
Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment.
arXiv Detail & Related papers (2024-07-14T01:43:07Z) - Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models [81.01397924280612]
Large language models (LLMs) can achieve highly effective performance on various reasoning tasks by incorporating step-by-step chain-of-thought (CoT) prompting as demonstrations.
We introduce Iter-CoT (Iterative bootstrapping in Chain-of-Thoughts Prompting), an iterative bootstrapping approach for selecting exemplars and generating reasoning chains.
arXiv Detail & Related papers (2023-04-23T13:54:39Z) - Language Models Are Greedy Reasoners: A Systematic Formal Analysis of
Chain-of-Thought [10.524051272257614]
Large language models (LLMs) have shown remarkable reasoning capabilities given chain-of-thought prompts.
We present a new synthetic question-answering dataset called PrOntoQA, where each example is generated as a synthetic world model.
This allows us to parse the generated chain-of-thought into symbolic proofs for formal analysis.
arXiv Detail & Related papers (2022-10-03T21:34:32Z) - multiPRover: Generating Multiple Proofs for Improved Interpretability in
Rule Reasoning [73.09791959325204]
We focus on a type of linguistic formal reasoning where the goal is to reason over explicit knowledge in the form of natural language facts and rules.
A recent work, named PRover, performs such reasoning by answering a question and also generating a proof graph that explains the answer.
In our work, we address a new and challenging problem of generating multiple proof graphs for reasoning over natural language rule-bases.
arXiv Detail & Related papers (2021-06-02T17:58:35Z) - Finding Good Proofs for Description Logic Entailments Using Recursive
Quality Measures (Extended Technical Report) [15.150938933215906]
How comprehensible a proof is depends not only on the employed calculus, but also on the properties of the particular proof.
We aim for general results that hold for wide classes of calculi and measures.
arXiv Detail & Related papers (2021-04-27T12:34:13Z) - ProofWriter: Generating Implications, Proofs, and Abductive Statements
over Natural Language [19.917022148887273]
Transformers have been shown to emulate logical deduction over natural language theories.
We show that a generative model, called ProofWriter, can reliably generate both implications of a theory and the natural language proof(s) that support them.
arXiv Detail & Related papers (2020-12-24T00:55:46Z) - Measuring Systematic Generalization in Neural Proof Generation with
Transformers [24.157460902865854]
We investigate how well Transformer language models (TLMs) can perform logical reasoning tasks when trained on knowledge encoded in natural language.
Specifically, we perform soft theorem-proving by leveraging TLMs to generate natural language proofs.
We observe length-generalization issues when evaluated on longer-than-trained sequences.
arXiv Detail & Related papers (2020-09-30T16:54:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.