Less Is More for Multi-Step Logical Reasoning of LLM Generalisation Under Rule Removal, Paraphrasing, and Compression
- URL: http://arxiv.org/abs/2512.06393v2
- Date: Fri, 12 Dec 2025 09:31:52 GMT
- Title: Less Is More for Multi-Step Logical Reasoning of LLM Generalisation Under Rule Removal, Paraphrasing, and Compression
- Authors: Qiming Bao, Xiaoxuan Fu,
- Abstract summary: Large language models (LLMs) achieve strong performance on many natural language tasks, yet their generalisation under structured perturbations of logical rule systems remains insufficiently characterised.<n>We present a controlled evaluation framework that probes reasoning reliability through four stress tests.
- Score: 3.3492355863487275
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) achieve strong performance on many natural language tasks, yet their generalisation under structured perturbations of logical rule systems remains insufficiently characterised. We present a controlled evaluation framework that probes reasoning reliability through four stress tests: (1) rule deletion, removing redundant versus essential rules from a multi-step inference chain; (2) contradictory evidence injection; (3) logic-preserving rewrites based on equivalence laws (contraposition, double negation, implication-to-disjunction, De Morgan, identity, and commutativity); and (4) multi-law equivalence stacking that composes 2--5 transformations. Across three representative model families -- BERT, Qwen2, and LLaMA-like models -- all models attain Acc$=1.0000$ on the base split and show no degradation under redundant rule deletion. In contrast, essential rule deletion yields a pronounced decrease to near-chance performance, and injecting explicit contradictions reduces accuracy to 0.0000. Under logic-preserving rewrites, accuracy is largely preserved for single-law transformations with only small degradations in a few cases, whereas multi-law stacking exposes model-dependent sensitivity: BERT matches the base condition, TinyLlama shows only marginal degradation, and Qwen2 exhibits a substantial drop. Overall, the results indicate that contemporary LLMs are generally stable under semantic-preserving reformulations, yet remain brittle to missing or inconsistent evidence and may degrade under composed logical transformations depending on the model family. The proposed framework provides a concise diagnostic tool for isolating these failure modes and for evaluating logical generalisation beyond surface-form variation.
Related papers
- Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation [40.210132040677]
This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary Large Language Models (LLMs)<n>Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results.
arXiv Detail & Related papers (2026-02-19T12:24:42Z) - Evaluating Robustness of Reasoning Models on Parameterized Logical Problems [20.78623024814435]
Logic provides a controlled testbed for evaluating LLM-based reasoners.<n>Standard SAT-style benchmarks often conflate surface difficulty (length, wording, clause order) with the structural phenomena that actually determine satisfiability.<n>We introduce a diagnostic benchmark for 2-SAT built from parameterized families of structured 2--CNF formulas.
arXiv Detail & Related papers (2026-02-13T06:54:25Z) - Same Answer, Different Representations: Hidden instability in VLMs [65.36933543377346]
We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness.<n>We apply this framework to modern Vision Language Models (VLMs) across the SEEDBench, MMMU, and POPE datasets.
arXiv Detail & Related papers (2026-02-06T12:24:26Z) - VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning [4.3414302048068745]
We present a neurosymbolic framework that combines Large Language Models with SMT solvers to produce verification-guided answers.<n>We introduce three key innovations: (1) multi-model consensus via formal semantic equivalence checking, (2) semantic routing that directs different claim types to appropriate verification strategies, and (3) precise logical error localization via Minimal Correction Subsets.<n>With the GPT-OSS-120B model, VERGE demonstrates an average performance uplift of 18.7% at convergence across a set of reasoning benchmarks compared to single-pass approaches.
arXiv Detail & Related papers (2026-01-27T20:59:11Z) - Improving Symbolic Translation of Language Models for Logical Reasoning [14.474630644806723]
Small language models (LMs) often struggle with translating natural language (NL) into first-order logic (FOL)<n>Existing approaches typically rely on self-iteration to correct these errors, but such methods depend heavily on the capabilities of the underlying model.<n>We introduce incremental inference, which divides inference into two stages, predicate generation and FOL translation, providing greater control over model behavior.
arXiv Detail & Related papers (2026-01-14T12:47:14Z) - The Hidden Cost of Approximation in Online Mirror Descent [56.99972253009168]
Online mirror descent (OMD) is a fundamental algorithmic paradigm that underlies many algorithms in optimization, machine learning and sequential decision-making.<n>In this work we initiate a systematic study into inexact OMD, and uncover an intricate relation between regularizer smoothness and robustness to approximation errors.
arXiv Detail & Related papers (2025-11-27T10:09:07Z) - Are Language Models Efficient Reasoners? A Perspective from Logic Programming [109.47572890883248]
Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency.<n>We propose a framework for assessing LM reasoning efficiency through the lens of logic programming.
arXiv Detail & Related papers (2025-10-29T15:30:31Z) - LOGicalThought: Logic-Based Ontological Grounding of LLMs for High-Assurance Reasoning [33.30049437667383]
High-assurance reasoning requires conclusions that are accurate, verifiable, and grounded in evidence.<n>This paper proposes a novel neurosymbolically-grounded architecture called LOGicalThought.<n>It uses an advanced logical language and reasoner in conjunction with an LLM to construct a dual symbolic graph context and logic-based context.
arXiv Detail & Related papers (2025-10-02T00:06:23Z) - From Ambiguity to Verdict: A Semiotic-Grounded Multi-Perspective Agent for LLM Logical Reasoning [16.381034926435074]
LogicAgent is a semiotic-square-guided framework designed to jointly address logical complexity and semantic complexity.<n>To overcome the semantic simplicity and low logical complexity of existing datasets, we introduce RepublicQA, a benchmark that reaches college-level difficulty.<n>Experiments demonstrate that LogicAgent achieves state-of-the-art performance on RepublicQA, with a 6.25% average gain over strong baselines.
arXiv Detail & Related papers (2025-09-29T13:31:22Z) - ReaLM: Reflection-Enhanced Autonomous Reasoning with Small Language Models [76.28894983518164]
Small Language Models (SLMs) are a cost-effective alternative to Large Language Models (LLMs)<n>They often struggle with complex reasoning due to their limited capacity and a tendency to produce mistakes or inconsistent answers.<n>We introduce ReaLM, a reinforcement learning framework for robust and self-sufficient reasoning in vertical domains.
arXiv Detail & Related papers (2025-08-17T14:50:23Z) - Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations [13.485604499678262]
Natural language explanations play a fundamental role in Natural Language Inference (NLI)<n>Recent work has shown that the interaction of large language models (LLMs) with theorem provers (TPs) can help verify and improve the validity of NLI explanations.<n>This paper investigates strategies to alleviate semantic loss during autoformalisation.
arXiv Detail & Related papers (2025-05-30T06:38:39Z) - Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective [59.7140089198992]
We develop a mathematic framework that defines abstract reasoning as the ability to extract essential patterns.<n>We introduce two novel complementary metrics: (scoreGamma) measures basic reasoning accuracy, while (scoreDelta) quantifies a model's reliance on specific symbols.
arXiv Detail & Related papers (2025-05-28T09:02:45Z) - Learning to Reason via Mixture-of-Thought for Logical Reasoning [56.24256916896427]
Mixture-of-Thought (MoT) is a framework that enables LLMs to reason across three complementary modalities: natural language, code, and truth-table.<n>MoT adopts a two-phase design: (1) self-evolving MoT training, which jointly learns from filtered, self-generated rationales across modalities; and (2) MoT inference, which fully leverages the synergy of three modalities to produce better predictions.
arXiv Detail & Related papers (2025-05-21T17:59:54Z) - Benchmarking Gaslighting Negation Attacks Against Multimodal Large Language Models [45.63440666848143]
Multimodal Large Language Models (MLLMs) have exhibited remarkable advancements in integrating different modalities.<n>Despite their success, MLLMs remain vulnerable to conversational adversarial inputs.<n>We study gaslighting negation attacks: a phenomenon where models, despite initially providing correct answers, are persuaded by user-provided negations to reverse their outputs.
arXiv Detail & Related papers (2025-01-31T10:37:48Z) - Aligning with Logic: Measuring, Evaluating and Improving Logical Preference Consistency in Large Language Models [31.558429029429863]
Large Language Models (LLMs) are expected to be predictable and trustworthy to support reliable decision-making systems.<n>This work examines logical preference consistency as a foundational requirement for building more dependable LLM systems.<n>We show that improving consistency leads to better performance in LLM-driven logic-based algorithms.
arXiv Detail & Related papers (2024-10-03T04:34:04Z) - Towards Logically Sound Natural Language Reasoning with Logic-Enhanced Language Model Agents [3.5083201638203154]
Logic-Enhanced Language Model Agents (LELMA) is a framework that integrates large language models with formal logic.<n>LeLMA employs autoformalization to translate reasoning into logic representations, which are then used to assess logical validity.<n>LeLMA achieves high accuracy in error detection and improves reasoning correctness via self-refinement.
arXiv Detail & Related papers (2024-08-28T18:25:35Z) - Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs [87.34281749422756]
Large language models (LLMs) have achieved impressive human-like performance across various reasoning tasks.
However, their mastery of underlying inferential rules still falls short of human capabilities.
We propose a logic scaffolding inferential rule generation framework, to construct an inferential rule base, ULogic.
arXiv Detail & Related papers (2024-02-18T03:38:51Z) - Finding Alignments Between Interpretable Causal Variables and
Distributed Neural Representations [62.65877150123775]
Causal abstraction is a promising theoretical framework for explainable artificial intelligence.
Existing causal abstraction methods require a brute-force search over alignments between the high-level model and the low-level one.
We present distributed alignment search (DAS), which overcomes these limitations.
arXiv Detail & Related papers (2023-03-05T00:57:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.