Related papers: LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models

LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models

URL: http://arxiv.org/abs/2602.06533v1
Date: Fri, 06 Feb 2026 09:38:44 GMT
Title: LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models
Authors: Brian Rabern, Philipp Mondorf, Barbara Plank,
Abstract summary: We isolate three fundamental logic skills into first-order logic models.<n>Items are drawn from two first-order logic (without English) and are presented in both a and a Carroll-style nonce words.<n>Across leading models, performance is substantially lower but high validity.
Score: 37.930280449304696
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models have demonstrated notable performance across various logical reasoning benchmarks. However, it remains unclear which core logical skills they truly master. To address this, we introduce LogicSkills, a unified benchmark designed to isolate three fundamental skills in formal reasoning: (i) $\textit{formal symbolization}\unicode{x2014}$translating premises into first-order logic; (ii) $\textit{countermodel construction}\unicode{x2014}$formulating a finite structure in which all premises are true while the conclusion is false; and (iii) $\textit{validity assessment}\unicode{x2014}$deciding whether a conclusion follows from a given set of premises. Items are drawn from the two-variable fragment of first-order logic (without identity) and are presented in both natural English and a Carroll-style language with nonce words. All examples are verified for correctness and non-triviality using the SMT solver Z3. Across leading models, performance is high on validity but substantially lower on symbolization and countermodel construction, suggesting reliance on surface-level patterns rather than genuine symbolic or rule-based reasoning.

Related papers

NL2LOGIC: AST-Guided Translation of Natural Language into First-Order Logic with Large Language Models [5.211983629897431]
We propose NL2LOGIC, a first-order logic translation framework.<n> Experiments on LogicNLI, abstract ProofWriter benchmarks show that NL2LOGIC achieves 99 percent syntactic accuracy and improves semantic correctness by up to 30 percent over state-of-the-art baselines.<n> integrating NL2LOGIC into Logic-LM yields near-perfect executability and improves downstream reasoning accuracy by 31 percent compared to Logic-LM's original few-shot unconstrained translation module.
arXiv Detail & Related papers (2026-01-29T14:51:32Z)
From Hypothesis to Premises: LLM-based Backward Logical Reasoning with Selective Symbolic Translation [8.104087344683604]
We propose a novel framework, Hypothesis-driven Backward Logical Reasoning (HBLR)<n>The core idea is to integrate confidence-aware symbolic translation with hypothesis-driven backward reasoning.<n>HBLR consistently outperforms strong baselines in both accuracy and efficiency.
arXiv Detail & Related papers (2025-12-03T01:52:31Z)
DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models [58.439517684779936]
This paper proposes a new classical logic benchmark DivLogicEval, consisting of natural sentences composed of diverse statements in a counterintuitive way.<n>To ensure a more reliable evaluation, we also introduce a new evaluation metric that mitigates the influence of bias and randomness inherent in Large Language Models.
arXiv Detail & Related papers (2025-09-19T04:40:46Z)
Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification [56.218970738892764]
Chain-of-Thought prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs)<n>To mitigate hallucinations in CoT that are notoriously difficult to detect, current methods operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness.<n>We propose a retrospective, step-aware formal verification framework $Safe$. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations.
arXiv Detail & Related papers (2025-06-05T03:16:08Z)
Transformer-based Language Models for Reasoning in the Description Logic ALCQ [2.8210912543324658]
We construct the natural language dataset, DELTA$_D$, using the expressive description logic language $mathcalALCQ$. We investigate the logical reasoning capabilities of a supervised fine-tuned DeBERTa-based model and two large language models. We show that the DeBERTa-based model fine-tuned on our dataset can master the entailment checking task.
arXiv Detail & Related papers (2024-10-12T18:25:34Z)
Transformers in the Service of Description Logic-based Contexts [2.8210912543324658]
We construct the natural language dataset, DELTA$_D$, using the description logic language $mathcalALCQ$. We investigate the reasoning ability of a supervised fine-tuned DeBERTa-based model and of two large language models (GPT-3.5, GPT-4) with few-shot prompting. Our results demonstrate that the DeBERTa-based model can master the reasoning task and that the performance of GPTs can improve significantly even when a small number of samples is provided.
arXiv Detail & Related papers (2023-11-15T13:23:24Z)
Three Dogmas, a Puzzle and its Solution [0.0]
In this paper we show that those assumptions contradict basic principles of Arabic. The Logicians ideas, that within Natural Language words refer to objects, 'ToBe'-constructions represent identity statements. Indefinite Descriptions must be replaced by existential quantifiers to form meaningful Sentences and Symbols can have no interpretation-independent meanings.
arXiv Detail & Related papers (2023-10-29T19:20:38Z)
LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers [60.009969929857704]
Logical reasoning is an important task for artificial intelligence with potential impacts on science, mathematics, and society. In this work, we reformulating such tasks as modular neurosymbolic programming, which we call LINC. We observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate.
arXiv Detail & Related papers (2023-10-23T17:58:40Z)
MetaLogic: Logical Reasoning Explanations with Fine-Grained Structure [129.8481568648651]
We propose a benchmark to investigate models' logical reasoning capabilities in complex real-life scenarios. Based on the multi-hop chain of reasoning, the explanation form includes three main components. We evaluate the current best models' performance on this new explanation form.
arXiv Detail & Related papers (2022-10-22T16:01:13Z)
RobustLR: Evaluating Robustness to Logical Perturbation in Deductive Reasoning [25.319674132967553]
Transformers have been shown to be able to perform deductive reasoning on a logical rulebase containing rules and statements written in English natural language. We propose RobustLR to evaluate the robustness of these models to minimal logical edits in rulebases. We find that the models trained in prior works do not perform consistently on the different perturbations in RobustLR.
arXiv Detail & Related papers (2022-05-25T09:23:50Z)
Logic-Driven Context Extension and Data Augmentation for Logical Reasoning of Text [65.24325614642223]
We propose to understand logical symbols and expressions in the text to arrive at the answer. Based on such logical information, we put forward a context extension framework and a data augmentation algorithm. Our method achieves the state-of-the-art performance, and both logic-driven context extension framework and data augmentation algorithm can help improve the accuracy.
arXiv Detail & Related papers (2021-05-08T10:09:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.