Related papers: Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study

Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study

URL: http://arxiv.org/abs/2506.04810v2
Date: Thu, 09 Oct 2025 12:32:49 GMT
Title: Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study
Authors: Yujun Zhou, Jiayi Ye, Zipeng Ling, Yufei Han, Yue Huang, Haomin Zhuang, Zhenwen Liang, Kehan Guo, Taicheng Guo, Xiangqi Wang, Xiangliang Zhang,
Abstract summary: We introduce FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions.<n>We study how different supervision formats in fine-tuning shape reasoning abilities.<n>We find a key trade-off: natural language supervision excels at generalization, whereas symbolic supervision is superior at instilling structurally sound, atomic reasoning steps.
Score: 40.143148197878354
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Logical reasoning is a core capability for large language models (LLMs), yet existing benchmarks that rely solely on final-answer accuracy fail to capture the quality of the reasoning process. To address this, we introduce FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions: overall accuracy, stepwise soundness, and representation-level probing. Leveraging this framework, we conduct a comprehensive study on how different supervision formats in fine-tuning shape reasoning abilities. We fine-tune LLMs on four supervision styles: one in natural language and three symbolic variants. We find a key trade-off: natural language supervision excels at generalization to out-of-distribution and long-chain problems, whereas symbolic supervision is superior at instilling structurally sound, atomic reasoning steps. Furthermore, our probing analysis indicates that fine-tuning primarily refines the model's step-by-step generation process, rather than improving its ability to converge on an answer early. Together, our framework and analysis provide a more rigorous lens for evaluating and improving logical reasoning in LLMs. The code is available at https://github.com/YujunZhou/FineLogic.

Related papers

Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning [11.255428720705204]
We propose a framework that decomposes the process of identifying significant distinctions between cases into three-stage reasoning tasks.<n>Our framework models cases using factual predicates called factors, organizes them into a legal knowledge hierarchy, and defines verifiable rules for identifying distinctions.<n>We find that models consistently expend more computational resources on incorrect responses than correct ones, suggesting that "thinking longer" does not always mean "thinking smarter"
arXiv Detail & Related papers (2025-10-09T18:15:28Z)
Implicit Reasoning in Large Language Models: A Comprehensive Survey [67.53966514728383]
Large Language Models (LLMs) have demonstrated strong generalization across a wide range of tasks.<n>Recent studies have shifted attention from explicit chain-of-thought prompting toward implicit reasoning.<n>This survey introduces a taxonomy centered on execution paradigms, shifting the focus from representational forms to computational strategies.
arXiv Detail & Related papers (2025-09-02T14:16:02Z)
From Language to Logic: A Bi-Level Framework for Structured Reasoning [6.075080928704587]
Structured reasoning over natural language inputs remains a core challenge in artificial intelligence.<n>We propose a novel framework that maps language to logic through a two-stage process: high-level task abstraction and low-level logic generation.<n>Our approach significantly outperforms existing baselines in accuracy, with accuracy gains reaching as high as 40%.
arXiv Detail & Related papers (2025-07-11T11:24:09Z)
CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection [60.98964268961243]
We propose that guiding models to perform a systematic and comprehensive reasoning process allows models to execute much finer-grained and accurate entailment decisions.<n>We define a 3-step reasoning process, consisting of (i) claim decomposition, (ii) sub-claim attribution and entailment classification, and (iii) aggregated classification, showing that such guided reasoning indeed yields improved hallucination detection.
arXiv Detail & Related papers (2025-06-05T17:02:52Z)
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs [34.2218892593144]
MME-Reasoning is a benchmark designed to evaluate the reasoning ability of large language models (MLLMs)<n>Our evaluation reveals substantial limitations of state-of-the-art MLLMs when subjected to holistic assessments of logical reasoning capabilities.<n>In addition, we conducted an in-depth analysis of approaches such as thinking mode'' and Rule-based RL, which are commonly believed to enhance reasoning abilities.
arXiv Detail & Related papers (2025-05-27T15:23:23Z)
P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains [97.25943550933829]
We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains. We use P-FOLIO to evaluate and improve large-language-model (LLM) reasoning capabilities.
arXiv Detail & Related papers (2024-10-11T19:22:57Z)
Proof of Thought : Neurosymbolic Program Synthesis allows Robust and Interpretable Reasoning [1.3003982724617653]
Large Language Models (LLMs) have revolutionized natural language processing, yet they struggle with inconsistent reasoning. This research introduces Proof of Thought, a framework that enhances the reliability and transparency of LLM outputs. Key contributions include a robust type system with sort management for enhanced logical integrity, explicit representation of rules for clear distinction between factual and inferential knowledge.
arXiv Detail & Related papers (2024-09-25T18:35:45Z)
LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models [52.03659714625452]
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "reason" over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied.
arXiv Detail & Related papers (2024-04-23T21:08:49Z)
Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs [87.34281749422756]
Large language models (LLMs) have achieved impressive human-like performance across various reasoning tasks. However, their mastery of underlying inferential rules still falls short of human capabilities. We propose a logic scaffolding inferential rule generation framework, to construct an inferential rule base, ULogic.
arXiv Detail & Related papers (2024-02-18T03:38:51Z)
LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs) Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models. We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z)
Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems. LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning. We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z)
Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond [46.75497042978449]
Large Language Models (LLMs) have emerged as a noteworthy innovation in natural language processing (NLP) We aim to bridge this gap and provide comprehensive evaluations in this paper. Considering the comprehensiveness of evaluations, we include 3 early-era representative LLMs and 4 trending LLMs.
arXiv Detail & Related papers (2023-06-16T13:39:35Z)
Exploring Self-supervised Logic-enhanced Training for Large Language Models [59.227222647741094]
In this paper, we make the first attempt to investigate the feasibility of incorporating logical knowledge through self-supervised post-training. We devise an auto-regressive objective variant of MERIt and integrate it with two LLM series, i.e., FLAN-T5 and LLaMA, with parameter size ranging from 3 billion to 13 billion. The results on two challenging logical reasoning benchmarks demonstrate the effectiveness of LogicLLM.
arXiv Detail & Related papers (2023-05-23T06:13:10Z)
Improved Logical Reasoning of Language Models via Differentiable Symbolic Programming [12.984852480664378]
Pre-trained large language models (LMs) struggle to perform logical reasoning reliably despite advances in scale and compositionality. We propose DSR-LM, a Differentiable Symbolic Reasoning framework where pre-trained LMs govern the perception of factual knowledge, and a symbolic module performs deductive reasoning.
arXiv Detail & Related papers (2023-05-05T07:24:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.