Your thoughts tell who you are: Characterize the reasoning patterns of LRMs
- URL: http://arxiv.org/abs/2509.24147v1
- Date: Mon, 29 Sep 2025 00:52:07 GMT
- Title: Your thoughts tell who you are: Characterize the reasoning patterns of LRMs
- Authors: Yida Chen, Yuning Mao, Xianjun Yang, Suyu Ge, Shengjie Bi, Lijuan Liu, Saghar Hosseini, Liang Tan, Yixin Nie, Shaoliang Nie,
- Abstract summary: We use a generative language model to compare reasoning traces from two LRMs and articulate their distinctive features in words.<n>Iterating this process over a dataset of reasoning traces yields a human-language taxonomy that characterizes how models think.<n>LOT identifies systematic differences in their thoughts, achieving 80-100% accuracy in distinguishing reasoning traces from LRMs that differ in scale, base model family, or objective domain.
- Score: 31.313418571838152
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current comparisons of large reasoning models (LRMs) focus on macro-level statistics such as task accuracy or reasoning length. Whether different LRMs reason differently remains an open question. To address this gap, we introduce the LLM-proposed Open Taxonomy (LOT), a classification method that uses a generative language model to compare reasoning traces from two LRMs and articulate their distinctive features in words. LOT then models how these features predict the source LRM of a reasoning trace based on their empirical distributions across LRM outputs. Iterating this process over a dataset of reasoning traces yields a human-readable taxonomy that characterizes how models think. We apply LOT to compare the reasoning of 12 open-source LRMs on tasks in math, science, and coding. LOT identifies systematic differences in their thoughts, achieving 80-100% accuracy in distinguishing reasoning traces from LRMs that differ in scale, base model family, or objective domain. Beyond classification, LOT's natural-language taxonomy provides qualitative explanations of how LRMs think differently. Finally, in a case study, we link the reasoning differences to performance: aligning the reasoning style of smaller Qwen3 models with that of the largest Qwen3 during test time improves their accuracy on GPQA by 3.3-5.7%.
Related papers
- Step-Tagging: Toward controlling the generation of Language Reasoning Models through step monitoring [5.190961793309368]
A growing body of studies show that Language Reasoning Models (LRMs) are still inefficient, over-generating verification and reflection steps.<n>We introduce the Step-Tagging framework, a lightweight sentence-classifier enabling real-time annotation of the type of reasoning steps that an LRM is generating.<n>Online monitoring of the count of specific steps can produce effective interpretable early stopping criteria of LRM inferences.
arXiv Detail & Related papers (2025-12-16T12:01:16Z) - ReJump: A Tree-Jump Representation for Analyzing and Improving LLM Reasoning [29.544265034647434]
ReJump represents a reasoning trace as a visitation order over nodes in a tree of intermediate problem-solving steps.<n>We evaluate state-of-the-art Large Language Models (LRMs) on two tasks and find that models with similar accuracy can exhibit distinct reasoning behaviors.<n>To further understand how learning strategies shape reasoning, we use ReJump to compare distilled LRMs with their teachers, CoT-prompted LLMs with LRMs, and to examine how the number of reasoning examples and reinforcement learning affect reasoning behavior.
arXiv Detail & Related papers (2025-11-30T10:39:53Z) - ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction [70.53044880892196]
We introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT)<n>To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and 38,000 viewpoints.<n> Evaluations on 10 leading LLMs on ARCHE Bench reveal that models exhibit a trade-off between REA and EC, and none are yet able to extract a complete and standard reasoning chain.
arXiv Detail & Related papers (2025-11-16T07:37:09Z) - Verifying Large Language Models' Reasoning Paths via Correlation Matrix Rank [71.09032766271493]
Large language models (LLMs) are prone to errors and hallucinations.<n>How to check their outputs effectively and efficiently has become a critical problem in their applications.
arXiv Detail & Related papers (2025-10-28T11:01:10Z) - A Study on Thinking Patterns of Large Reasoning Models in Code Generation [14.138043269602074]
Large language models (LLMs) are utilized for software engineering tasks such as code generation.<n>This paper presents a comprehensive study aimed at investigating and uncovering the reasoning behavior of LRMs during code generation.<n>We derive a taxonomy of LRM reasoning behaviors, encompassing 15 reasoning actions across four phases.
arXiv Detail & Related papers (2025-09-17T07:13:12Z) - FairReason: Balancing Reasoning and Social Bias in MLLMs [54.26091556079722]
Multimodal Large Language Models (MLLMs) already achieve state-of-the-art results across a wide range of tasks and modalities.<n>Recent studies explore advanced prompting schemes and post-training fine-tuning to push their reasoning ability further.
arXiv Detail & Related papers (2025-07-30T19:57:22Z) - Towards Evaluting Fake Reasoning Bias in Language Models [47.482898076525494]
We show that models favor the surface structure of reasoning even when the logic is flawed.<n>We introduce THEATER, a benchmark that systematically investigates Fake Reasoning Bias (FRB)<n>We evaluate 17 advanced Large Language Models (LRMs) on both subjective DPO and factual datasets.
arXiv Detail & Related papers (2025-07-18T09:06:10Z) - What makes Reasoning Models Different? Follow the Reasoning Leader for Efficient Decoding [84.42056293290015]
We analyze the token-level misalignment between reasoning and non-reasoning models.<n>Motivated by the Local Misalignment Diminish, we propose FoReaL-Decoding.<n>On four popular math-reasoning benchmarks, FoReaL-Decoding reduces theoretical FLOPs by 30 to 50% and trims CoT length by up to 40%.
arXiv Detail & Related papers (2025-06-08T05:08:32Z) - Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns [79.42805969325036]
Process Reward Models (PRMs) are crucial in complex reasoning and problem-solving tasks.<n>PRMs are required to identify errors under various reasoning patterns during the reasoning process.<n>Existing benchmarks mainly focus on evaluating PRMs with stepwise correctness.<n>We introduce Socratic-PRMBench, a new benchmark to evaluate PRMs systematically under six reasoning patterns.
arXiv Detail & Related papers (2025-05-29T14:26:53Z) - Generalizable Process Reward Models via Formally Verified Training Data [13.781401358802462]
FoVer is an approach to synthesize PRM training data with accurate step-level error labels automatically annotated by formal verification tools.<n>Experiments show that PRMs trained with FoVer exhibit cross-task generalization, enabling a single PRM to effectively perform verification across diverse reasoning tasks.
arXiv Detail & Related papers (2025-05-21T19:23:45Z) - Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models [33.547353090281284]
We propose a novel reward model approach called the Hierarchical Reward Model.<n>It evaluates both individual and consecutive reasoning steps at both fine-grained and coarse-grained levels.<n>It excels at assessing multi-step reasoning coherence, especially when flawed steps are later corrected through self-reflection.
arXiv Detail & Related papers (2025-03-16T15:18:40Z) - P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains [97.25943550933829]
We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains.
We use P-FOLIO to evaluate and improve large-language-model (LLM) reasoning capabilities.
arXiv Detail & Related papers (2024-10-11T19:22:57Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.