Steamroller Problems: An Evaluation of LLM Reasoning Capability with Automated Theorem Prover Strategies
- URL: http://arxiv.org/abs/2407.20244v1
- Date: Wed, 17 Jul 2024 22:49:23 GMT
- Title: Steamroller Problems: An Evaluation of LLM Reasoning Capability with Automated Theorem Prover Strategies
- Authors: Lachlan McGinness, Peter Baumgartner,
- Abstract summary: We evaluate the performance of GPT4, GPT3.5 Turbo and Google's recent Gemini model on problems from a steamroller domain.
We found that the models' performance when using the ATP reasoning strategies was comparable to one-shot chain of thought.
- Score: 0.18416014644193066
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study presents the first examination of the ability of Large Language Models (LLMs) to follow reasoning strategies that are used to guide Automated Theorem Provers (ATPs). We evaluate the performance of GPT4, GPT3.5 Turbo and Google's recent Gemini model on problems from a steamroller domain. In addition to determining accuracy we make use of the Natural Language Processing library spaCy to explore new methods of investigating LLM's reasoning capabilities. This led to one alarming result, the low correlation between correct reasoning and correct answers for any of the tested models. We found that the models' performance when using the ATP reasoning strategies was comparable to one-shot chain of thought and observe that attention to uncertainty in the accuracy results is critical when drawing conclusions about model performance. Consistent with previous speculation we confirm that LLMs have a preference for, and are best able to follow, bottom up reasoning processes. However, the reasoning strategies can still be beneficial for deriving small and relevant sets of formulas for external processing by a trusted inference engine.
Related papers
- S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.
Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z) - Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving [55.895917967408586]
Existing approaches to mathematical reasoning with large language models rely on Chain-of-Thought (CoT) for generalizability or Tool-Integrated Reasoning (TIR) for precise computation.
We propose TATA (Teaching LLMs According to Their Aptitude), an adaptive framework that enables LLMs to personalize their reasoning strategy spontaneously.
arXiv Detail & Related papers (2025-02-17T16:56:23Z) - Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.
This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.
We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z) - Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying [0.3659498819753633]
State-of-the-art Large Language models (LLMs) continue to struggle when performing logical and mathematical reasoning.
This paper makes use of the notion of critical questions from the literature on argumentation theory, focusing in particular on Toulmin's model of argumentation.
We show that employing these critical questions can improve the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-19T18:51:30Z) - Lachesis: Predicting LLM Inference Accuracy using Structural Properties of Reasoning Paths [12.377041655669728]
We introduce Lachesis, a predictive model for self-consistency based LLM inferences.
We empirically evaluate it using AutoFL, a recently proposed LLM-based fault localisation technique.
Results suggest that Lachesis can predict the correctness of answers with a precision of up to 0.8136.
arXiv Detail & Related papers (2024-12-11T10:56:47Z) - Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding [74.31981011985681]
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps.
We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution.
We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures.
arXiv Detail & Related papers (2024-11-06T22:02:30Z) - Will LLMs Replace the Encoder-Only Models in Temporal Relation Classification? [2.1861408994125253]
Large Language Models (LLM) have recently shown promising performance in temporal reasoning tasks.
Recent studies have tested the LLMs' performance in detecting temporal relations of closed-source models only.
arXiv Detail & Related papers (2024-10-14T13:10:45Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs)
Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models.
We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.