Related papers: ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction

ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction

URL: http://arxiv.org/abs/2511.12485v1
Date: Sun, 16 Nov 2025 07:37:09 GMT
Title: ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction
Authors: Pengze Li, Jiaqi Liu, Junchi Yu, Lihao Liu, Mingyu Ding, Wanli Ouyang, Shixiang Tang, Xi Chen,
Abstract summary: We introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT)<n>To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and 38,000 viewpoints.<n> Evaluations on 10 leading LLMs on ARCHE Bench reveal that models exhibit a trade-off between REA and EC, and none are yet able to extract a complete and standard reasoning chain.
Score: 70.53044880892196
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly used in scientific domains. While they can produce reasoning-like content via methods such as chain-of-thought prompting, these outputs are typically unstructured and informal, obscuring whether models truly understand the fundamental reasoning paradigms that underpin scientific inference. To address this, we introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT). In RLT, all reasoning steps are explicitly categorized as one of three variants of Peirce's fundamental inference modes: deduction, induction, or abduction. To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and 38,000 viewpoints. We propose two logic-aware evaluation metrics: Entity Coverage (EC) for content completeness and Reasoning Edge Accuracy (REA) for step-by-step logical validity. Evaluations on 10 leading LLMs on ARCHE Bench reveal that models exhibit a trade-off between REA and EC, and none are yet able to extract a complete and standard reasoning chain. These findings highlight a substantial gap between the abilities of current reasoning models and the rigor required for scientific argumentation.

Related papers

Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models [2.172419551358714]
Large language models (LLMs) have shown strong performance across natural language reasoning tasks, yet their reasoning processes remain brittle and difficult to interpret.<n>We introduce Theorem-of-Thought (ToTh), a novel framework that models reasoning as collaboration among three parallel agents.<n> Experiments on symbolic (WebOfLies) and numerical (MultiArithm) reasoning benchmarks show that ToTh consistently outperforms CoT, Self-Consistency, and CoT-Decoding.
arXiv Detail & Related papers (2025-06-08T12:28:38Z)
CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection [60.98964268961243]
We propose that guiding models to perform a systematic and comprehensive reasoning process allows models to execute much finer-grained and accurate entailment decisions.<n>We define a 3-step reasoning process, consisting of (i) claim decomposition, (ii) sub-claim attribution and entailment classification, and (iii) aggregated classification, showing that such guided reasoning indeed yields improved hallucination detection.
arXiv Detail & Related papers (2025-06-05T17:02:52Z)
Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment [54.62926010621013]
We introduce a novel task, code reasoning, to provide a new perspective for the reasoning abilities of large language models.<n>We summarize three meta-benchmarks based on established forms of logical reasoning, and instantiate these into eight specific benchmark tasks.<n>We present a new pathway exploration pipeline inspired by human intricate problem-solving methods.
arXiv Detail & Related papers (2025-02-17T10:39:58Z)
Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying [0.3659498819753633]
State-of-the-art Large Language models (LLMs) continue to struggle when performing logical and mathematical reasoning.<n>This paper makes use of the notion of critical questions from the literature on argumentation theory, focusing in particular on Toulmin's model of argumentation.<n>We show that employing these critical questions can improve the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-19T18:51:30Z)
Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning [74.90592233107712]
We propose a Direct-Indirect Reasoning (DIR) method, which considers Direct Reasoning (DR) and Indirect Reasoning (IR) as multiple parallel reasoning paths that are merged to derive the final answer.<n>Our DIR method is simple yet effective and can be straightforwardly integrated with existing variants of CoT methods.
arXiv Detail & Related papers (2024-02-06T03:41:12Z)
DetermLR: Augmenting LLM-based Logical Reasoning from Indeterminacy to Determinacy [76.58614128865652]
We propose DetermLR, a novel perspective that rethinks the reasoning process as an evolution from indeterminacy to determinacy. First, we categorize known conditions into two types: determinate and indeterminate premises This provides an oveall direction for the reasoning process and guides LLMs in converting indeterminate data into progressively determinate insights. We automate the storage and extraction of available premises and reasoning paths with reasoning memory, preserving historical reasoning details for subsequent reasoning steps.
arXiv Detail & Related papers (2023-10-28T10:05:51Z)
Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems. LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning. We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z)
Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought [10.524051272257614]
Large language models (LLMs) have shown remarkable reasoning capabilities given chain-of-thought prompts. We present a new synthetic question-answering dataset called PrOntoQA, where each example is generated as a synthetic world model. This allows us to parse the generated chain-of-thought into symbolic proofs for formal analysis.
arXiv Detail & Related papers (2022-10-03T21:34:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.