Related papers: Less Is More for Multi-Step Logical Reasoning of LLM Generalisation Under Rule Removal, Paraphrasing, and Compression

Less Is More for Multi-Step Logical Reasoning of LLM Generalisation Under Rule Removal, Paraphrasing, and Compression

URL: http://arxiv.org/abs/2512.06393v2
Date: Fri, 12 Dec 2025 09:31:52 GMT
Title: Less Is More for Multi-Step Logical Reasoning of LLM Generalisation Under Rule Removal, Paraphrasing, and Compression
Authors: Qiming Bao, Xiaoxuan Fu,
Abstract summary: Large language models (LLMs) achieve strong performance on many natural language tasks, yet their generalisation under structured perturbations of logical rule systems remains insufficiently characterised.<n>We present a controlled evaluation framework that probes reasoning reliability through four stress tests.
Score: 3.3492355863487275
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) achieve strong performance on many natural language tasks, yet their generalisation under structured perturbations of logical rule systems remains insufficiently characterised. We present a controlled evaluation framework that probes reasoning reliability through four stress tests: (1) rule deletion, removing redundant versus essential rules from a multi-step inference chain; (2) contradictory evidence injection; (3) logic-preserving rewrites based on equivalence laws (contraposition, double negation, implication-to-disjunction, De Morgan, identity, and commutativity); and (4) multi-law equivalence stacking that composes 2--5 transformations. Across three representative model families -- BERT, Qwen2, and LLaMA-like models -- all models attain Acc$=1.0000$ on the base split and show no degradation under redundant rule deletion. In contrast, essential rule deletion yields a pronounced decrease to near-chance performance, and injecting explicit contradictions reduces accuracy to 0.0000. Under logic-preserving rewrites, accuracy is largely preserved for single-law transformations with only small degradations in a few cases, whereas multi-law stacking exposes model-dependent sensitivity: BERT matches the base condition, TinyLlama shows only marginal degradation, and Qwen2 exhibits a substantial drop. Overall, the results indicate that contemporary LLMs are generally stable under semantic-preserving reformulations, yet remain brittle to missing or inconsistent evidence and may degrade under composed logical transformations depending on the model family. The proposed framework provides a concise diagnostic tool for isolating these failure modes and for evaluating logical generalisation beyond surface-form variation.

Related papers

Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation [40.210132040677]
This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary Large Language Models (LLMs)<n>Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results.
arXiv Detail & Related papers (2026-02-19T12:24:42Z)
Evaluating Robustness of Reasoning Models on Parameterized Logical Problems [20.78623024814435]
Logic provides a controlled testbed for evaluating LLM-based reasoners.<n>Standard SAT-style benchmarks often conflate surface difficulty (length, wording, clause order) with the structural phenomena that actually determine satisfiability.<n>We introduce a diagnostic benchmark for 2-SAT built from parameterized families of structured 2--CNF formulas.
arXiv Detail & Related papers (2026-02-13T06:54:25Z)
Same Answer, Different Representations: Hidden instability in VLMs [65.36933543377346]
We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness.<n>We apply this framework to modern Vision Language Models (VLMs) across the SEEDBench, MMMU, and POPE datasets.
arXiv Detail & Related papers (2026-02-06T12:24:26Z)
VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning [4.3414302048068745]
We present a neurosymbolic framework that combines Large Language Models with SMT solvers to produce verification-guided answers.<n>We introduce three key innovations: (1) multi-model consensus via formal semantic equivalence checking, (2) semantic routing that directs different claim types to appropriate verification strategies, and (3) precise logical error localization via Minimal Correction Subsets.<n>With the GPT-OSS-120B model, VERGE demonstrates an average performance uplift of 18.7% at convergence across a set of reasoning benchmarks compared to single-pass approaches.
arXiv Detail & Related papers (2026-01-27T20:59:11Z)
Improving Symbolic Translation of Language Models for Logical Reasoning [14.474630644806723]
Small language models (LMs) often struggle with translating natural language (NL) into first-order logic (FOL)<n>Existing approaches typically rely on self-iteration to correct these errors, but such methods depend heavily on the capabilities of the underlying model.<n>We introduce incremental inference, which divides inference into two stages, predicate generation and FOL translation, providing greater control over model behavior.
arXiv Detail & Related papers (2026-01-14T12:47:14Z)
The Hidden Cost of Approximation in Online Mirror Descent [56.99972253009168]
Online mirror descent (OMD) is a fundamental algorithmic paradigm that underlies many algorithms in optimization, machine learning and sequential decision-making.<n>In this work we initiate a systematic study into inexact OMD, and uncover an intricate relation between regularizer smoothness and robustness to approximation errors.
arXiv Detail & Related papers (2025-11-27T10:09:07Z)
Are Language Models Efficient Reasoners? A Perspective from Logic Programming [109.47572890883248]
Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency.<n>We propose a framework for assessing LM reasoning efficiency through the lens of logic programming.
arXiv Detail & Related papers (2025-10-29T15:30:31Z)
LOGicalThought: Logic-Based Ontological Grounding of LLMs for High-Assurance Reasoning [33.30049437667383]
High-assurance reasoning requires conclusions that are accurate, verifiable, and grounded in evidence.<n>This paper proposes a novel neurosymbolically-grounded architecture called LOGicalThought.<n>It uses an advanced logical language and reasoner in conjunction with an LLM to construct a dual symbolic graph context and logic-based context.
arXiv Detail & Related papers (2025-10-02T00:06:23Z)
From Ambiguity to Verdict: A Semiotic-Grounded Multi-Perspective Agent for LLM Logical Reasoning [16.381034926435074]
LogicAgent is a semiotic-square-guided framework designed to jointly address logical complexity and semantic complexity.<n>To overcome the semantic simplicity and low logical complexity of existing datasets, we introduce RepublicQA, a benchmark that reaches college-level difficulty.<n>Experiments demonstrate that LogicAgent achieves state-of-the-art performance on RepublicQA, with a 6.25% average gain over strong baselines.
arXiv Detail & Related papers (2025-09-29T13:31:22Z)
ReaLM: Reflection-Enhanced Autonomous Reasoning with Small Language Models [76.28894983518164]
Small Language Models (SLMs) are a cost-effective alternative to Large Language Models (LLMs)<n>They often struggle with complex reasoning due to their limited capacity and a tendency to produce mistakes or inconsistent answers.<n>We introduce ReaLM, a reinforcement learning framework for robust and self-sufficient reasoning in vertical domains.
arXiv Detail & Related papers (2025-08-17T14:50:23Z)
Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations [13.485604499678262]
Natural language explanations play a fundamental role in Natural Language Inference (NLI)<n>Recent work has shown that the interaction of large language models (LLMs) with theorem provers (TPs) can help verify and improve the validity of NLI explanations.<n>This paper investigates strategies to alleviate semantic loss during autoformalisation.
arXiv Detail & Related papers (2025-05-30T06:38:39Z)
Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective [59.7140089198992]
We develop a mathematic framework that defines abstract reasoning as the ability to extract essential patterns.<n>We introduce two novel complementary metrics: (scoreGamma) measures basic reasoning accuracy, while (scoreDelta) quantifies a model's reliance on specific symbols.
arXiv Detail & Related papers (2025-05-28T09:02:45Z)
Learning to Reason via Mixture-of-Thought for Logical Reasoning [56.24256916896427]
Mixture-of-Thought (MoT) is a framework that enables LLMs to reason across three complementary modalities: natural language, code, and truth-table.<n>MoT adopts a two-phase design: (1) self-evolving MoT training, which jointly learns from filtered, self-generated rationales across modalities; and (2) MoT inference, which fully leverages the synergy of three modalities to produce better predictions.
arXiv Detail & Related papers (2025-05-21T17:59:54Z)
Benchmarking Gaslighting Negation Attacks Against Multimodal Large Language Models [45.63440666848143]
Multimodal Large Language Models (MLLMs) have exhibited remarkable advancements in integrating different modalities.<n>Despite their success, MLLMs remain vulnerable to conversational adversarial inputs.<n>We study gaslighting negation attacks: a phenomenon where models, despite initially providing correct answers, are persuaded by user-provided negations to reverse their outputs.
arXiv Detail & Related papers (2025-01-31T10:37:48Z)
Aligning with Logic: Measuring, Evaluating and Improving Logical Preference Consistency in Large Language Models [31.558429029429863]
Large Language Models (LLMs) are expected to be predictable and trustworthy to support reliable decision-making systems.<n>This work examines logical preference consistency as a foundational requirement for building more dependable LLM systems.<n>We show that improving consistency leads to better performance in LLM-driven logic-based algorithms.
arXiv Detail & Related papers (2024-10-03T04:34:04Z)
Towards Logically Sound Natural Language Reasoning with Logic-Enhanced Language Model Agents [3.5083201638203154]
Logic-Enhanced Language Model Agents (LELMA) is a framework that integrates large language models with formal logic.<n>LeLMA employs autoformalization to translate reasoning into logic representations, which are then used to assess logical validity.<n>LeLMA achieves high accuracy in error detection and improves reasoning correctness via self-refinement.
arXiv Detail & Related papers (2024-08-28T18:25:35Z)
Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs [87.34281749422756]
Large language models (LLMs) have achieved impressive human-like performance across various reasoning tasks. However, their mastery of underlying inferential rules still falls short of human capabilities. We propose a logic scaffolding inferential rule generation framework, to construct an inferential rule base, ULogic.
arXiv Detail & Related papers (2024-02-18T03:38:51Z)
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations [62.65877150123775]
Causal abstraction is a promising theoretical framework for explainable artificial intelligence. Existing causal abstraction methods require a brute-force search over alignments between the high-level model and the low-level one. We present distributed alignment search (DAS), which overcomes these limitations.
arXiv Detail & Related papers (2023-03-05T00:57:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.