Related papers: Measuring Reasoning in LLMs: a New Dialectical Angle

Measuring Reasoning in LLMs: a New Dialectical Angle

URL: http://arxiv.org/abs/2510.18134v1
Date: Mon, 20 Oct 2025 22:08:59 GMT
Title: Measuring Reasoning in LLMs: a New Dialectical Angle
Authors: Soheil Abbasloo,
Abstract summary: We present SIEV, a structured framework that evaluates reasoning of language models through dialectics.<n>Unlike conventional evaluations, SIEV assesses not only the conclusion a model reaches, but how it gets there.<n>For instance, GPT-5-chat, a recent model, loses over 40 points (out of 100) when evaluated with SIEV on GSM.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: What does it truly mean for a language model to "reason"? Most current evaluations and benchmarks reward models' correct standalone answers--but correctness alone reveals little about the process that produced them. In this work, we explore a different perspective: reasoning is not a static chain of steps, but a dynamic trajectory where ideas interact, clash, and evolve into deeper insights. To capture this dynamic, we draw on a well-established philosophical tradition: \textit{dialectics}, where reasoning unfolds through thesis, antithesis, and synthesis. Building on this, we present SIEV, a structured framework that evaluates reasoning of LLMs through dialectics. Unlike conventional evaluations, SIEV assesses not only the conclusion a model reaches, but how it gets there: its ability to resolve tension, integrate distinct ideas, and synthesize higher-order reasoning. This lens uncovers significant reasoning gaps in state-of-the-art models even under saturated benchmarks like GSM and MMLU. For instance, GPT-5-chat, a recent model, loses over 40 points (out of 100) when evaluated with SIEV on GSM. Our findings highlight that adopting a process-oriented, philosophically grounded approach enables a deeper, more rigorous, and more discriminative assessment of LLM reasoning.

Related papers

ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction [70.53044880892196]
We introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT)<n>To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and 38,000 viewpoints.<n> Evaluations on 10 leading LLMs on ARCHE Bench reveal that models exhibit a trade-off between REA and EC, and none are yet able to extract a complete and standard reasoning chain.
arXiv Detail & Related papers (2025-11-16T07:37:09Z)
RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation [5.080252830507515]
Reasoning Process Tree Score (RPTS) is a tree structure-based metric to assess reasoning processes.<n>To validate RPTS in real-world multimodal scenarios, we construct a new benchmark, RPTS-Eval, comprising 374 images and 390 reasoning instances.
arXiv Detail & Related papers (2025-11-10T09:48:07Z)
MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables [50.29407048003165]
We present MORABLES, a human-verified benchmark built from fables and short stories drawn from historical literature.<n>The main task is structured as multiple-choice questions targeting moral inference, with carefully crafted distractors that challenge models to go beyond shallow, extractive question answering.<n>Our findings show that, while larger models outperform smaller ones, they remain susceptible to adversarial manipulation and often rely on superficial patterns rather than true moral reasoning.
arXiv Detail & Related papers (2025-09-15T19:06:10Z)
Answer-Centric or Reasoning-Driven? Uncovering the Latent Memory Anchor in LLMs [28.556628696390767]
Large Language Models (LLMs) demonstrate impressive reasoning capabilities.<n>Evidence suggests much of their success stems from memorized answer-reasoning patterns rather than genuine inference.<n>We propose a five-level answer-visibility prompt framework that systematically manipulates answer cues and probes model behavior through indirect, behavioral analysis.
arXiv Detail & Related papers (2025-06-21T08:15:45Z)
Assessing LLMs in Art Contexts: Critique Generation and Theory of Mind Evaluation [0.3359875577705537]
This study explores how large language models (LLMs) perform in two areas related to art.<n>For the critique generation part, we built a system that combines Noel Carroll's evaluative framework with a broad selection of art criticism theories.<n>These critiques were compared with those written by human experts in a Turing test-style evaluation.<n>In the second part, we introduced new simple ToM tasks based on situations involving interpretation, emotion, and moral tension.
arXiv Detail & Related papers (2025-04-17T10:10:25Z)
Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment [54.62926010621013]
We introduce a novel task, code reasoning, to provide a new perspective for the reasoning abilities of large language models.<n>We summarize three meta-benchmarks based on established forms of logical reasoning, and instantiate these into eight specific benchmark tasks.<n>We present a new pathway exploration pipeline inspired by human intricate problem-solving methods.
arXiv Detail & Related papers (2025-02-17T10:39:58Z)
Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying [0.3659498819753633]
State-of-the-art Large Language models (LLMs) continue to struggle when performing logical and mathematical reasoning.<n>This paper makes use of the notion of critical questions from the literature on argumentation theory, focusing in particular on Toulmin's model of argumentation.<n>We show that employing these critical questions can improve the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-19T18:51:30Z)
On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks [17.329365493094542]
We present a principled empirical study of the performance of GPT-4 in three domains: Game of 24, Graph Coloring, and STRIPS planning. We observe significant performance collapse with self-critique and significant performance gains with sound external verification.
arXiv Detail & Related papers (2024-02-12T23:11:01Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
The ART of LLM Refinement: Ask, Refine, and Trust [85.75059530612882]
We propose a reasoning with refinement objective called ART: Ask, Refine, and Trust. It asks necessary questions to decide when an LLM should refine its output. It achieves a performance gain of +5 points over self-refinement baselines.
arXiv Detail & Related papers (2023-11-14T07:26:32Z)
Sentiment Analysis through LLM Negotiations [58.67939611291001]
A standard paradigm for sentiment analysis is to rely on a singular LLM and makes the decision in a single round. This paper introduces a multi-LLM negotiation framework for sentiment analysis.
arXiv Detail & Related papers (2023-11-03T12:35:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.