Measuring Reasoning in LLMs: a New Dialectical Angle
- URL: http://arxiv.org/abs/2510.18134v1
- Date: Mon, 20 Oct 2025 22:08:59 GMT
- Title: Measuring Reasoning in LLMs: a New Dialectical Angle
- Authors: Soheil Abbasloo,
- Abstract summary: We present SIEV, a structured framework that evaluates reasoning of language models through dialectics.<n>Unlike conventional evaluations, SIEV assesses not only the conclusion a model reaches, but how it gets there.<n>For instance, GPT-5-chat, a recent model, loses over 40 points (out of 100) when evaluated with SIEV on GSM.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: What does it truly mean for a language model to "reason"? Most current evaluations and benchmarks reward models' correct standalone answers--but correctness alone reveals little about the process that produced them. In this work, we explore a different perspective: reasoning is not a static chain of steps, but a dynamic trajectory where ideas interact, clash, and evolve into deeper insights. To capture this dynamic, we draw on a well-established philosophical tradition: \textit{dialectics}, where reasoning unfolds through thesis, antithesis, and synthesis. Building on this, we present SIEV, a structured framework that evaluates reasoning of LLMs through dialectics. Unlike conventional evaluations, SIEV assesses not only the conclusion a model reaches, but how it gets there: its ability to resolve tension, integrate distinct ideas, and synthesize higher-order reasoning. This lens uncovers significant reasoning gaps in state-of-the-art models even under saturated benchmarks like GSM and MMLU. For instance, GPT-5-chat, a recent model, loses over 40 points (out of 100) when evaluated with SIEV on GSM. Our findings highlight that adopting a process-oriented, philosophically grounded approach enables a deeper, more rigorous, and more discriminative assessment of LLM reasoning.
Related papers
- ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction [70.53044880892196]
We introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT)<n>To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and 38,000 viewpoints.<n> Evaluations on 10 leading LLMs on ARCHE Bench reveal that models exhibit a trade-off between REA and EC, and none are yet able to extract a complete and standard reasoning chain.
arXiv Detail & Related papers (2025-11-16T07:37:09Z) - RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation [5.080252830507515]
Reasoning Process Tree Score (RPTS) is a tree structure-based metric to assess reasoning processes.<n>To validate RPTS in real-world multimodal scenarios, we construct a new benchmark, RPTS-Eval, comprising 374 images and 390 reasoning instances.
arXiv Detail & Related papers (2025-11-10T09:48:07Z) - MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables [50.29407048003165]
We present MORABLES, a human-verified benchmark built from fables and short stories drawn from historical literature.<n>The main task is structured as multiple-choice questions targeting moral inference, with carefully crafted distractors that challenge models to go beyond shallow, extractive question answering.<n>Our findings show that, while larger models outperform smaller ones, they remain susceptible to adversarial manipulation and often rely on superficial patterns rather than true moral reasoning.
arXiv Detail & Related papers (2025-09-15T19:06:10Z) - Answer-Centric or Reasoning-Driven? Uncovering the Latent Memory Anchor in LLMs [28.556628696390767]
Large Language Models (LLMs) demonstrate impressive reasoning capabilities.<n>Evidence suggests much of their success stems from memorized answer-reasoning patterns rather than genuine inference.<n>We propose a five-level answer-visibility prompt framework that systematically manipulates answer cues and probes model behavior through indirect, behavioral analysis.
arXiv Detail & Related papers (2025-06-21T08:15:45Z) - Assessing LLMs in Art Contexts: Critique Generation and Theory of Mind Evaluation [0.3359875577705537]
This study explores how large language models (LLMs) perform in two areas related to art.<n>For the critique generation part, we built a system that combines Noel Carroll's evaluative framework with a broad selection of art criticism theories.<n>These critiques were compared with those written by human experts in a Turing test-style evaluation.<n>In the second part, we introduced new simple ToM tasks based on situations involving interpretation, emotion, and moral tension.
arXiv Detail & Related papers (2025-04-17T10:10:25Z) - Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment [54.62926010621013]
We introduce a novel task, code reasoning, to provide a new perspective for the reasoning abilities of large language models.<n>We summarize three meta-benchmarks based on established forms of logical reasoning, and instantiate these into eight specific benchmark tasks.<n>We present a new pathway exploration pipeline inspired by human intricate problem-solving methods.
arXiv Detail & Related papers (2025-02-17T10:39:58Z) - Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying [0.3659498819753633]
State-of-the-art Large Language models (LLMs) continue to struggle when performing logical and mathematical reasoning.<n>This paper makes use of the notion of critical questions from the literature on argumentation theory, focusing in particular on Toulmin's model of argumentation.<n>We show that employing these critical questions can improve the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-19T18:51:30Z) - On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks [17.329365493094542]
We present a principled empirical study of the performance of GPT-4 in three domains: Game of 24, Graph Coloring, and STRIPS planning.
We observe significant performance collapse with self-critique and significant performance gains with sound external verification.
arXiv Detail & Related papers (2024-02-12T23:11:01Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - The ART of LLM Refinement: Ask, Refine, and Trust [85.75059530612882]
We propose a reasoning with refinement objective called ART: Ask, Refine, and Trust.
It asks necessary questions to decide when an LLM should refine its output.
It achieves a performance gain of +5 points over self-refinement baselines.
arXiv Detail & Related papers (2023-11-14T07:26:32Z) - Sentiment Analysis through LLM Negotiations [58.67939611291001]
A standard paradigm for sentiment analysis is to rely on a singular LLM and makes the decision in a single round.
This paper introduces a multi-LLM negotiation framework for sentiment analysis.
arXiv Detail & Related papers (2023-11-03T12:35:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.