Language Models Are Greedy Reasoners: A Systematic Formal Analysis of
Chain-of-Thought
- URL: http://arxiv.org/abs/2210.01240v1
- Date: Mon, 3 Oct 2022 21:34:32 GMT
- Title: Language Models Are Greedy Reasoners: A Systematic Formal Analysis of
Chain-of-Thought
- Authors: Abulhair Saparov and He He
- Abstract summary: Large language models (LLMs) have shown remarkable reasoning capabilities given chain-of-thought prompts.
We present a new synthetic question-answering dataset called PrOntoQA, where each example is generated as a synthetic world model.
This allows us to parse the generated chain-of-thought into symbolic proofs for formal analysis.
- Score: 10.524051272257614
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have shown remarkable reasoning capabilities
given chain-of-thought prompts (examples with intermediate reasoning steps).
Existing benchmarks measure reasoning ability indirectly, by evaluating
accuracy on downstream tasks such as mathematical reasoning. However, it is
unclear how these models obtain the answers and whether they rely on simple
heuristics rather than the generated chain-of-thought. To enable systematic
exploration of the reasoning ability of LLMs, we present a new synthetic
question-answering dataset called PrOntoQA, where each example is generated
from a synthetic world model represented in first-order logic. This allows us
to parse the generated chain-of-thought into symbolic proofs for formal
analysis. Our analysis on InstructGPT and GPT-3 shows that LLMs are quite
capable of making correct individual deduction steps, and so are generally
capable of reasoning, even in fictional contexts. However, they have difficulty
with proof planning: When multiple valid deduction steps are available, they
are not able to systematically explore the different options.
Related papers
- JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models [51.99046112135311]
We introduce JustLogic, a synthetically generated deductive reasoning benchmark for rigorous evaluation of Large Language Models.
JustLogic is highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures.
Our experimental results reveal that most state-of-the-art (SOTA) LLMs perform significantly worse than the human average.
arXiv Detail & Related papers (2025-01-24T15:49:10Z) - Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying [0.3659498819753633]
State-of-the-art Large Language models (LLMs) continue to struggle when performing logical and mathematical reasoning.
This paper makes use of the notion of critical questions from the literature on argumentation theory, focusing in particular on Toulmin's model of argumentation.
We show that employing these critical questions can improve the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-19T18:51:30Z) - P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains [97.25943550933829]
We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains.
We use P-FOLIO to evaluate and improve large-language-model (LLM) reasoning capabilities.
arXiv Detail & Related papers (2024-10-11T19:22:57Z) - Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation [110.71955853831707]
We view LMs as deriving new conclusions by aggregating indirect reasoning paths seen at pre-training time.
We formalize the reasoning paths as random walk paths on the knowledge/reasoning graphs.
Experiments and analysis on multiple KG and CoT datasets reveal the effect of training on random walk paths.
arXiv Detail & Related papers (2024-02-05T18:25:51Z) - Neuro-Symbolic Integration Brings Causal and Reliable Reasoning Proofs [95.07757789781213]
Two lines of approaches are adopted for complex reasoning with LLMs.
One line of work prompts LLMs with various reasoning structures, while the structural outputs can be naturally regarded as intermediate reasoning steps.
The other line of work adopt LLM-free declarative solvers to do the reasoning task, rendering higher reasoning accuracy but lacking interpretability due to the black-box nature of the solvers.
We present a simple extension to the latter line of work. Specifically, we showcase that the intermediate search logs generated by Prolog interpreters can be accessed and interpreted into human-readable reasoning.
arXiv Detail & Related papers (2023-11-16T11:26:21Z) - Are LLMs Rigorous Logical Reasoner? Empowering Natural Language Proof Generation with Contrastive Stepwise Decoding [10.421832675327712]
We introduce contrastive decoding to stepwise proof generation, making use of negative reasoning paths to strengthen the model's capacity for logical deduction.
Experiments on EntailmentBank underscore the success of our method in augmenting the proof planning abilities of language models.
arXiv Detail & Related papers (2023-11-12T05:12:49Z) - Towards a Mechanistic Interpretation of Multi-Step Reasoning
Capabilities of Language Models [107.07851578154242]
Language models (LMs) have strong multi-step (i.e., procedural) reasoning capabilities.
It is unclear whether LMs perform tasks by cheating with answers memorized from pretraining corpus, or, via a multi-step reasoning mechanism.
We show that MechanisticProbe is able to detect the information of the reasoning tree from the model's attentions for most examples.
arXiv Detail & Related papers (2023-10-23T01:47:29Z) - LAMBADA: Backward Chaining for Automated Reasoning in Natural Language [11.096348678079574]
Backward Chaining algorithm, called LAMBADA, decomposes reasoning into four sub-modules.
We show that LAMBADA achieves sizable accuracy boosts over state-of-the-art forward reasoning methods.
arXiv Detail & Related papers (2022-12-20T18:06:03Z) - Evaluating Step-by-Step Reasoning through Symbolic Verification [20.156768135017007]
Pre-trained language models (LMs) have shown remarkable reasoning performance for in-context learning.
LMLP enjoys more than $25%$ higher accuracy than chain-of-thoughts (CoT) on length generalization benchmarks even with smaller model sizes.
arXiv Detail & Related papers (2022-12-16T19:30:01Z) - MetaLogic: Logical Reasoning Explanations with Fine-Grained Structure [129.8481568648651]
We propose a benchmark to investigate models' logical reasoning capabilities in complex real-life scenarios.
Based on the multi-hop chain of reasoning, the explanation form includes three main components.
We evaluate the current best models' performance on this new explanation form.
arXiv Detail & Related papers (2022-10-22T16:01:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.