Related papers: Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance

Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance

URL: http://arxiv.org/abs/2510.27544v1
Date: Fri, 31 Oct 2025 15:17:55 GMT
Title: Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance
Authors: Nikolaus Holzer, William Fishell, Baishakhi Ray, Mark Santolucito,
Abstract summary: Large Language Models (LLMs) are outpacing human performance on many tasks.<n>We introduce TempoBench, the first formally grounded and verifiable diagnostic benchmark.<n>We find that models score 65.6% on TCE-normal, and 7.5% on TCE-hard.
Score: 10.26577135499472
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large Language Models (LLMs) are increasingly excelling and outpacing human performance on many tasks. However, to improve LLM reasoning, researchers either rely on ad-hoc generated datasets or formal mathematical proof systems such as the Lean proof assistant. Whilst ad-hoc generated methods can capture the decision chains of real-world reasoning processes, they may encode some inadvertent bias in the space of reasoning they cover; they also cannot be formally verified. On the other hand, systems like Lean can guarantee verifiability, but are not well-suited to capture the nature of agentic decision chain-based tasks. This creates a gap both in performance for functions such as business agents or code assistants, and in the usefulness of LLM reasoning benchmarks, whereby these fall short in reasoning structure or real-world alignment. We introduce TempoBench, the first formally grounded and verifiable diagnostic benchmark that parametrizes difficulty to systematically analyze how LLMs perform reasoning. TempoBench uses two evaluation benchmarks to break down reasoning ability. First, temporal trace evaluation (TTE) tests the ability of an LLM to understand and simulate the execution of a given multi-step reasoning system. Subsequently, temporal causal evaluation (TCE) tests an LLM's ability to perform multi-step causal reasoning and to distill cause-and-effect relations from complex systems. We find that models score 65.6% on TCE-normal, and 7.5% on TCE-hard. This shows that state-of-the-art LLMs clearly understand the TCE task but perform poorly as system complexity increases. Our code is available at our \href{https://github.com/nik-hz/tempobench}{GitHub repository}.

Related papers

Enhancing Mathematical Problem Solving in LLMs through Execution-Driven Reasoning Augmentation [18.636244209466266]
Iteratively Improved Program Construction (IIPC) is a reasoning method that iteratively refines programmatic reasoning chains and combines execution feedback with the native Chain-of-thought abilities of the base LLM.<n>IIPC surpasses competing approaches in the majority of reasoning benchmarks on multiple base LLMs.
arXiv Detail & Related papers (2026-02-03T19:13:31Z)
A State-Transition Framework for Efficient LLM Reasoning [58.18141262230392]
Long Chain-of-Thought (CoT) reasoning significantly improves Large Language Models (LLMs) performance on complex reasoning tasks.<n>Existing studies usually enhance the reasoning efficiency of LLMs by compressing CoT sequences.<n>We propose an efficient reasoning framework that models the reasoning process of LLMs as a state-transition process.
arXiv Detail & Related papers (2026-02-01T12:40:40Z)
Verifying Large Language Models' Reasoning Paths via Correlation Matrix Rank [71.09032766271493]
Large language models (LLMs) are prone to errors and hallucinations.<n>How to check their outputs effectively and efficiently has become a critical problem in their applications.
arXiv Detail & Related papers (2025-10-28T11:01:10Z)
Can Past Experience Accelerate LLM Reasoning? [7.481959757090105]
Humans can perform tasks faster and better with increased experience and exposure.<n>LLMs can generally reason faster with past experience, achieving up to a 56% reduction in compute cost.
arXiv Detail & Related papers (2025-05-27T02:44:00Z)
Reasoning LLMs are Wandering Solution Explorers [5.3795217858078805]
This paper formalizes what constitutes systematic problem solving and identifies common failure modes that reveal reasoning LLMs to be wanderers rather than systematic explorers.<n>Our findings suggest that current models' performance can appear to be competent on simple tasks yet degrade sharply as complexity increases.
arXiv Detail & Related papers (2025-05-26T17:59:53Z)
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [49.61246073215651]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks.<n>Recent advancements in OpenAI o1 and DeepSeek-R1 have further improved performance in System-2 reasoning domains.<n>However, they also introduce significant computational overhead due to verbose and redundant outputs.
arXiv Detail & Related papers (2025-03-20T17:59:38Z)
From System 1 to System 2: A Survey of Reasoning Large Language Models [72.87412996793957]
Foundational Large Language Models excel at fast decision-making but lack depth for complex reasoning.<n>OpenAI's o1/o3 and DeepSeek's R1 have demonstrated expert-level performance in fields such as mathematics and coding.
arXiv Detail & Related papers (2025-02-24T18:50:52Z)
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [48.28847964704554]
Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to solve complex reasoning tasks.<n>We propose a novel approach for continuous-space reasoning that does not require modifying the LLM.
arXiv Detail & Related papers (2025-02-17T18:52:29Z)
CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models [5.409370027524351]
We evaluate the performance of large language models (LLMs) in counterfactual reasoning.<n>We introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions.
arXiv Detail & Related papers (2025-02-16T06:19:37Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
Cofca: A Step-Wise Counterfactual Multi-hop QA benchmark [39.64489055580211]
We introduce a Step-wise Counterfactual benchmark (CofCA), a novel evaluation benchmark consisting of factual data and counterfactual data. Our experimental results reveal a significant performance gap between Wikipedia-based factual data and counterfactual data, deeming data contamination issues in existing benchmarks.
arXiv Detail & Related papers (2024-02-19T08:12:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.