Related papers: Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples

Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples

URL: http://arxiv.org/abs/2305.15269v3
Date: Fri, 3 Nov 2023 18:45:56 GMT
Title: Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples
Authors: Abulhair Saparov, Richard Yuanzhe Pang, Vishakh Padmakumar, Nitish Joshi, Seyed Mehran Kazemi, Najoung Kim, He He
Abstract summary: Large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. We test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations. Experiments on four LLMs of various sizes and training objectives show that they are able to generalize to compositional proofs.
Score: 36.63316546586304
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Given the intractably large size of the space of proofs, any model that is capable of general deductive reasoning must generalize to proofs of greater complexity. Recent studies have shown that large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. However, they have primarily been tested on proofs using modus ponens or of a specific size, and from the same distribution as the in-context examples. To measure the general deductive reasoning ability of LLMs, we test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations from multiple angles: depth-, width-, and compositional generalization. To facilitate systematic exploration, we construct a new synthetic and programmable reasoning dataset that enables control over deduction rules and proof complexity. Our experiments on four LLMs of various sizes and training objectives show that they are able to generalize to compositional proofs. However, they have difficulty generalizing to longer proofs, and they require explicit demonstrations to produce hypothetical subproofs, specifically in proof by cases and proof by contradiction.

Related papers

DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning [67.93945726549289]
DeepTheorem is a comprehensive informal theorem-proving framework exploiting natural language to enhance mathematical reasoning.<n>DeepTheorem includes a large-scale benchmark dataset consisting of 121K high-quality IMO-level informal theorems and proofs.<n>We devise a novel reinforcement learning strategy (RL-Zero) explicitly tailored to informal theorem proving, leveraging the verified theorem variants to incentivize robust mathematical inference.
arXiv Detail & Related papers (2025-05-29T17:59:39Z)
Can Large Language Models Learn Formal Logic? A Data-Driven Training and Evaluation Framework [2.9334627971166336]
This paper investigates the logical reasoning capabilities of large language models (LLMs) A trained LLM receives as input a set of assumptions and a goal, and produces as output a proof that formally derives the goal from the assumptions. A critical obstacle for training is the scarcity of real-world proofs.
arXiv Detail & Related papers (2025-04-28T19:25:29Z)
InductionBench: LLMs Fail in the Simplest Complexity Class [53.70978746199222]
Large language models (LLMs) have shown remarkable improvements in reasoning. Inductive reasoning, where one infers the underlying rules from observed data, remains less explored. We introduce InductionBench, a new benchmark designed to evaluate the inductive reasoning ability of LLMs.
arXiv Detail & Related papers (2025-02-20T03:48:00Z)
Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation [27.60611509339328]
We argue that the optimal order for one training data sample occurs when the relevant intermediate supervision for a particular proof step is always positioned to the left of that proof step.<n>We demonstrate that training is most effective when the proof is in the intuitively sequential order.
arXiv Detail & Related papers (2024-10-30T18:00:04Z)
MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs [80.96119560172224]
Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to problems that are more complex than the ones on which they have been trained. We present a framework for evaluating LLMs on problems with arbitrarily complex arithmetic proofs, called MathGAP.
arXiv Detail & Related papers (2024-10-17T12:48:14Z)
Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs [99.76347807139615]
Reasoning encompasses two typical types: deductive reasoning and inductive reasoning. Despite extensive research into the reasoning capabilities of Large Language Models (LLMs), most studies have failed to rigorously differentiate between inductive and deductive reasoning. This raises an essential question: In LLM reasoning, which poses a greater challenge - deductive or inductive reasoning?
arXiv Detail & Related papers (2024-07-31T18:47:11Z)
Lean-STaR: Learning to Interleave Thinking and Proving [53.923617816215774]
We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof. Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment.
arXiv Detail & Related papers (2024-07-14T01:43:07Z)
Hypothesis Testing Prompting Improves Deductive Reasoning in Large Language Models [19.879616265315637]
textitHypothesis Testing Prompting adds conclusion assumptions, backward reasoning, and fact verification during intermediate reasoning steps. Experiments show that hypothesis testing prompting not only significantly improves the effect, but also generates a more reasonable and standardized reasoning process.
arXiv Detail & Related papers (2024-05-09T08:46:17Z)
Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models [81.01397924280612]
Large language models (LLMs) can achieve highly effective performance on various reasoning tasks by incorporating step-by-step chain-of-thought (CoT) prompting as demonstrations. We introduce Iter-CoT (Iterative bootstrapping in Chain-of-Thoughts Prompting), an iterative bootstrapping approach for selecting exemplars and generating reasoning chains.
arXiv Detail & Related papers (2023-04-23T13:54:39Z)
Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought [10.524051272257614]
Large language models (LLMs) have shown remarkable reasoning capabilities given chain-of-thought prompts. We present a new synthetic question-answering dataset called PrOntoQA, where each example is generated as a synthetic world model. This allows us to parse the generated chain-of-thought into symbolic proofs for formal analysis.
arXiv Detail & Related papers (2022-10-03T21:34:32Z)
multiPRover: Generating Multiple Proofs for Improved Interpretability in Rule Reasoning [73.09791959325204]
We focus on a type of linguistic formal reasoning where the goal is to reason over explicit knowledge in the form of natural language facts and rules. A recent work, named PRover, performs such reasoning by answering a question and also generating a proof graph that explains the answer. In our work, we address a new and challenging problem of generating multiple proof graphs for reasoning over natural language rule-bases.
arXiv Detail & Related papers (2021-06-02T17:58:35Z)
Finding Good Proofs for Description Logic Entailments Using Recursive Quality Measures (Extended Technical Report) [15.150938933215906]
How comprehensible a proof is depends not only on the employed calculus, but also on the properties of the particular proof. We aim for general results that hold for wide classes of calculi and measures.
arXiv Detail & Related papers (2021-04-27T12:34:13Z)
ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language [19.917022148887273]
Transformers have been shown to emulate logical deduction over natural language theories. We show that a generative model, called ProofWriter, can reliably generate both implications of a theory and the natural language proof(s) that support them.
arXiv Detail & Related papers (2020-12-24T00:55:46Z)
Measuring Systematic Generalization in Neural Proof Generation with Transformers [24.157460902865854]
We investigate how well Transformer language models (TLMs) can perform logical reasoning tasks when trained on knowledge encoded in natural language. Specifically, we perform soft theorem-proving by leveraging TLMs to generate natural language proofs. We observe length-generalization issues when evaluated on longer-than-trained sequences.
arXiv Detail & Related papers (2020-09-30T16:54:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.