Evaluating Robustness of Reasoning Models on Parameterized Logical Problems
- URL: http://arxiv.org/abs/2602.12665v1
- Date: Fri, 13 Feb 2026 06:54:25 GMT
- Title: Evaluating Robustness of Reasoning Models on Parameterized Logical Problems
- Authors: Naïm Es-sebbani, Esteban Marquer, Yakoub Salhi, Zied Bouraoui,
- Abstract summary: Logic provides a controlled testbed for evaluating LLM-based reasoners.<n>Standard SAT-style benchmarks often conflate surface difficulty (length, wording, clause order) with the structural phenomena that actually determine satisfiability.<n>We introduce a diagnostic benchmark for 2-SAT built from parameterized families of structured 2--CNF formulas.
- Score: 20.78623024814435
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Logic provides a controlled testbed for evaluating LLM-based reasoners, yet standard SAT-style benchmarks often conflate surface difficulty (length, wording, clause order) with the structural phenomena that actually determine satisfiability. We introduce a diagnostic benchmark for 2-SAT built from parameterized families of structured 2--CNF formulas, where satisfiability is characterized by the implication graph and can be tuned along interpretable axes. Our generators isolate distinct competencies and failure modes: (i) contradiction-cycle UNSAT cores with controllable size and imbalance, (ii) SAT instances with a prescribed fraction of free variables to control solution multiplicity, (iii) planted backbones that modulate propagation, (iv) late bridge clauses that couple otherwise monotone regions to probe sensitivity to ordering and revision, and (v) symmetry/duplication variants that test abstraction under renaming and redundant structure. We evaluate LLM-based reasoners on decision accuracy and assignment validity, and quantify robustness under semantics-preserving perturbations such as clause reordering, filler clauses, and variable renaming. Across models, we observe sharp performance transitions under targeted structural interventions even when surface statistics are held fixed, revealing brittleness regimes that are invisible to aggregate SAT accuracy.
Related papers
- Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation [40.210132040677]
This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary Large Language Models (LLMs)<n>Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results.
arXiv Detail & Related papers (2026-02-19T12:24:42Z) - Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study [0.3437656066916039]
We propose CLUES, a framework that models Text-to- Language as a two-stage process.<n>It decomposes semantic uncertainty into an ambiguity score and an instability score.<n> CLUES improves failure prediction over state-of-the-art Kernel Entropy matrix.
arXiv Detail & Related papers (2026-02-12T14:46:20Z) - Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure [58.89643769707751]
We study latent chain-of-thought as a manipulable causal process in representation space.<n>We find that latent-step budgets behave less like homogeneous extra depth and more like staged functionality with non-local routing.<n>These results motivate mode-conditional and stability-aware analyses as more reliable tools for interpreting and improving latent reasoning systems.
arXiv Detail & Related papers (2026-02-09T15:25:12Z) - VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning [4.3414302048068745]
We present a neurosymbolic framework that combines Large Language Models with SMT solvers to produce verification-guided answers.<n>We introduce three key innovations: (1) multi-model consensus via formal semantic equivalence checking, (2) semantic routing that directs different claim types to appropriate verification strategies, and (3) precise logical error localization via Minimal Correction Subsets.<n>With the GPT-OSS-120B model, VERGE demonstrates an average performance uplift of 18.7% at convergence across a set of reasoning benchmarks compared to single-pass approaches.
arXiv Detail & Related papers (2026-01-27T20:59:11Z) - Less Is More for Multi-Step Logical Reasoning of LLM Generalisation Under Rule Removal, Paraphrasing, and Compression [3.3492355863487275]
Large language models (LLMs) achieve strong performance on many natural language tasks, yet their generalisation under structured perturbations of logical rule systems remains insufficiently characterised.<n>We present a controlled evaluation framework that probes reasoning reliability through four stress tests.
arXiv Detail & Related papers (2025-12-06T10:49:50Z) - Statistical Hypothesis Testing for Auditing Robustness in Language Models [49.1574468325115]
We introduce distribution-based perturbation analysis, a framework that reformulates perturbation analysis as a frequentist hypothesis testing problem.<n>We construct empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling.<n>We show how we can quantify response changes, measure true/false positive rates, and evaluate alignment with reference models.
arXiv Detail & Related papers (2025-06-09T17:11:07Z) - Syntactic Control of Language Models by Posterior Inference [53.823006836309695]
Controlling the syntactic structure of text generated by language models is valuable for applications requiring clarity, stylistic consistency, or interpretability.<n>We argue that sampling algorithms based on the posterior inference can effectively enforce a target constituency structure during generation.<n>Our approach combines sequential Monte Carlo, which estimates the posterior distribution by sampling from a proposal distribution, with a syntactic tagger that ensures that each generated token aligns with the desired syntactic structure.
arXiv Detail & Related papers (2025-06-08T14:01:34Z) - Quantifying perturbation impacts for large language models [49.1574468325115]
We introduce Distribution-Based Perturbation Analysis (DBPA), a framework that reformulates perturbation analysis as a frequentist hypothesis testing problem.<n>We demonstrate the effectiveness of DBPA in evaluating perturbation impacts, showing its versatility for perturbation analysis.
arXiv Detail & Related papers (2024-12-01T16:13:09Z) - Data-light Uncertainty Set Merging with Admissibility [5.140740197135575]
This article introduces a Synthetics, Aggregation, and Test inversion (SAT) approach for merging diverse and potentially dependent uncertainty sets into a single unified set.<n>SAT is motivated by the challenge of integrating uncertainty sets when only the initial sets and their control levels are available.<n>Key theoretical contribution is a rigorous analysis of SAT's properties, including its admissibility in the context of deterministic set merging.
arXiv Detail & Related papers (2024-10-16T03:52:47Z) - Fully Stochastic Trust-Region Sequential Quadratic Programming for
Equality-Constrained Optimization Problems [62.83783246648714]
We propose a sequential quadratic programming algorithm (TR-StoSQP) to solve nonlinear optimization problems with objectives and deterministic equality constraints.
The algorithm adaptively selects the trust-region radius and, compared to the existing line-search StoSQP schemes, allows us to utilize indefinite Hessian matrices.
arXiv Detail & Related papers (2022-11-29T05:52:17Z) - Tractable Inference in Credal Sentential Decision Diagrams [116.6516175350871]
Probabilistic sentential decision diagrams are logic circuits where the inputs of disjunctive gates are annotated by probability values.
We develop the credal sentential decision diagrams, a generalisation of their probabilistic counterpart that allows for replacing the local probabilities with credal sets of mass functions.
For a first empirical validation, we consider a simple application based on noisy seven-segment display images.
arXiv Detail & Related papers (2020-08-19T16:04:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.