Related papers: Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces

Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces

URL: http://arxiv.org/abs/2602.14404v1
Date: Mon, 16 Feb 2026 02:20:37 GMT
Title: Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces
Authors: William L. Tong, Ege Cakar, Cengiz Pehlevan,
Abstract summary: We use PITA, a dataset of over 23 million statements in propositional logic and their corresponding proofs.<n>We find that RT models generalize well on broad and shallow subsets, while deteriorating on narrow and deep subsets relative to non-RT baselines.<n>Our resulting theory suggests fundamental scalings that limit how well RT models perform on deep tasks, and highlights their generalization strengths on broad tasks.
Score: 31.37944377681284
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent years have witnessed meteoric progress in reasoning models: neural networks that generate intermediate reasoning traces (RTs) before producing a final output. Despite the rapid advancement, our understanding of how RTs support reasoning, and the limits of this paradigm, remain incomplete. To promote greater clarity, we introduce PITA: a novel large-scale dataset of over 23 million statements in propositional logic and their corresponding proofs. As a benchmark for robust reasoning, we focus on length generalization: if a model is trained to determine truth or falsity on statements with proofs up to fixed length, how well does it generalize to statements requiring longer proofs? We propose notions of (1) task depth and (2) task breadth, which measure respectively (1) the number of steps required to solve an example from a task and (2) the number of unique examples across a task. We vary these quantities across subsets of PITA, and find that RT models generalize well on broad and shallow subsets, while deteriorating on narrow and deep subsets relative to non-RT baselines. To determine whether our results are idiosyncratic to PITA or indicative of general phenomena, we compare our results to a simple synthetic task based on syllogisms. Our resulting theory suggests fundamental scalings that limit how well RT models perform on deep tasks, and highlights their generalization strengths on broad tasks. Our findings overall identify fundamental benefits and limitations inherent in using reasoning traces.

Related papers

On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks [56.98385132295952]
We evaluate how well chain-of-thought approaches generalize on a simple planning task.<n>We find that reasoning traces which combine multiple text formats yield the best (and non-trivial) OOD generalization.<n> purely text-based models consistently outperform those utilizing image-based inputs.
arXiv Detail & Related papers (2026-02-17T09:51:40Z)
Structured Reasoning for Large Language Models [59.215789462977206]
We propose Structured Reasoning (SCR), a framework that decouples reasoning trajectories into explicit, evaluable, and trainable components.<n>SCR substantially improves reasoning efficiency and self-verification.<n>Compared with existing reasoning paradigms, it reduces output token length by up to 50%.
arXiv Detail & Related papers (2026-01-12T04:04:01Z)
ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction [70.53044880892196]
We introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT)<n>To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and 38,000 viewpoints.<n> Evaluations on 10 leading LLMs on ARCHE Bench reveal that models exhibit a trade-off between REA and EC, and none are yet able to extract a complete and standard reasoning chain.
arXiv Detail & Related papers (2025-11-16T07:37:09Z)
Consistency Is Not Always Correct: Towards Understanding the Role of Exploration in Post-Training Reasoning [75.79451512757844]
Foundation models exhibit broad knowledge but limited task-specific reasoning.<n> RLVR and inference scaling motivate post-training strategies such as RLVR and inference scaling.<n>We show that RLVR induces a squeezing effect, reducing reasoning entropy and forgetting some correct paths.
arXiv Detail & Related papers (2025-11-10T18:25:26Z)
The Challenge of Teaching Reasoning to LLMs Without RL or Distillation [31.973226821366325]
Reasoning-capable language models achieve state-of-the-art performance in diverse complex tasks by generating long, explicit Chain-of-Thought traces.<n>We ask whether long CoT can be induced in a base model using only prompting or minimal tuning.<n>The resulting model outperforms the much larger textttQwen2.5-Math-72B-Instruct, showing that a handful of high-quality examples can unlock strong reasoning capabilities.
arXiv Detail & Related papers (2025-07-14T01:14:50Z)
NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks [65.70224757972068]
We select reasoning traces from a strong teacher model based on a large pool of questions from NaturalReasoning.<n>We find that simply scaling up data size with random sampling is a strong baseline with steady performance gains.<n>We find that selecting difficult examples that require more diverse reasoning strategies is more sample-efficient to transfer the teacher model's reasoning skills.
arXiv Detail & Related papers (2025-07-02T17:30:24Z)
Are Large Language Models Capable of Deep Relational Reasoning? Insights from DeepSeek-R1 and Benchmark Comparisons [11.429641860623143]
We evaluate and compare the reasoning capabilities of three cutting-edge Large Language Models (LLMs)<n>DeepSeek-R1 consistently achieves the highest F1-scores across multiple tasks and problem sizes.<n>A detailed analysis of DeepSeek-R1's long Chain-of-Thought responses uncovers its unique planning and verification strategies.
arXiv Detail & Related papers (2025-06-29T07:37:49Z)
ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning [1.0416697066889342]
We propose a simple yet effective reinforcement learning method that enables reasoning models to learn their own optimal CoT lengths without manual supervision.<n>ShorterBetter achieves 50%-80% reduction in output lengths in both in-domain and out-of-domain reasoning tasks.<n>Our reasoning trace analysis shows that ShorterBetter refines the structure of the reasoning traces by reducing unnecessary repetition, excessive self-verification, and over-exploration of alternatives.
arXiv Detail & Related papers (2025-04-30T07:04:19Z)
Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length? [72.70486097967124]
We formalize a framework using deterministic finite automata (DFAs)<n>We show that there exists an optimal amount of reasoning tokens such that the probability of producing a correct solution is maximized.<n>We then demonstrate an implication of these findings: being able to predict the optimal number of reasoning tokens for new problems and filtering out non-optimal length answers results in consistent accuracy improvements.
arXiv Detail & Related papers (2025-04-02T17:45:58Z)
When More is Less: Understanding Chain-of-Thought Length in LLMs [51.631483479081645]
Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems.<n>This paper argues that longer CoTs are often presumed superior, arguing that longer is not always better.
arXiv Detail & Related papers (2025-02-11T05:28:59Z)
Distilling Reasoning Ability from Large Language Models with Adaptive Thinking [54.047761094420174]
Chain of thought finetuning (cot-finetuning) aims to endow small language models (SLM) with reasoning ability to improve their performance towards specific tasks.<n>Most existing cot-finetuning methods adopt a pre-thinking mechanism, allowing the SLM to generate a rationale before providing an answer.<n>This mechanism enables SLM to analyze and think about complex questions, but it also makes answer correctness highly sensitive to minor errors in rationale.<n>We propose a robust post-thinking mechanism to generate answers before rationale.
arXiv Detail & Related papers (2024-04-14T07:19:27Z)
The Impact of Reasoning Step Length on Large Language Models [40.546685248243534]
Chain of Thought (CoT) is significant in improving the reasoning abilities of large language models. We investigate the correlation between the effectiveness of CoT and the length of reasoning steps in prompts.
arXiv Detail & Related papers (2024-01-10T04:37:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.