LongReasonArena: A Long Reasoning Benchmark for Large Language Models
- URL: http://arxiv.org/abs/2508.19363v1
- Date: Tue, 26 Aug 2025 18:41:53 GMT
- Title: LongReasonArena: A Long Reasoning Benchmark for Large Language Models
- Authors: Jiayu Ding, Shuming Ma, Lei Cui, Nanning Zheng, Furu Wei,
- Abstract summary: LongReasonArena is a benchmark designed to assess the long reasoning capabilities of Large Language Models (LLMs)<n>By controlling the inputs, the required reasoning length can be arbitrarily scaled, reaching up to 1 million tokens of reasoning for the most challenging tasks.
- Score: 94.58086553970101
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing long-context benchmarks for Large Language Models (LLMs) focus on evaluating comprehension of long inputs, while overlooking the evaluation of long reasoning abilities. To address this gap, we introduce LongReasonArena, a benchmark specifically designed to assess the long reasoning capabilities of LLMs. Our tasks require models to solve problems by executing multi-step algorithms that reflect key aspects of long reasoning, such as retrieval and backtracking. By controlling the inputs, the required reasoning length can be arbitrarily scaled, reaching up to 1 million tokens of reasoning for the most challenging tasks. Extensive evaluation results demonstrate that LongReasonArena presents a significant challenge for both open-source and proprietary LLMs. For instance, Deepseek-R1 achieves only 7.5% accuracy on our task. Further analysis also reveals that the accuracy exhibits a linear decline with respect to the logarithm of the expected number of reasoning steps. Our code and data is available at https://github.com/LongReasonArena/LongReasonArena.
Related papers
- LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards [57.993003392037174]
LongR is a framework that enhances long-context performance by integrating a dynamic "Think-and-Read" mechanism.<n>LongR consistently enhances performance across diverse RL algorithms.
arXiv Detail & Related papers (2026-02-05T15:26:47Z) - R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth? [63.51955244144878]
R-HORIZON is a method designed to stimulate long-horizon reasoning behaviors in Large Reasoning Models (LRMs)<n>Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons.<n>Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately.
arXiv Detail & Related papers (2025-10-09T13:16:22Z) - Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking [50.97239453902612]
Large Reasoning Models (LRMs) have achieved impressive performance on challenging tasks, yet their deep reasoning often incurs substantial computational costs.<n>Inspired by Evidence Accumulation Models, we find that LRMs have accumulated sufficient information early in reasoning, making further reasoning steps redundant.<n>We propose Just-Enough Thinking (JET), which trains models to proactively terminate unnecessary reasoning.
arXiv Detail & Related papers (2025-09-27T16:25:06Z) - AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models [56.063571989395946]
The reasoning-capable large language models (LLMs) demonstrate strong performance on complex reasoning tasks.<n>Recent approaches attempt to address this challenge by manually deciding when to apply long or short reasoning.<n>We propose Auto Long-Short Reasoning (AutoL2S), a dynamic and model-agnostic framework that enables LLMs to dynamically compress their generated reasoning path.
arXiv Detail & Related papers (2025-05-28T17:59:53Z) - 100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability? [28.694112253150983]
Real-task-based long-context evaluation benchmarks have two major shortcomings.<n> benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability.<n>We introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities.
arXiv Detail & Related papers (2025-05-25T19:58:31Z) - Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs [52.405085773954596]
We find that large language models (LLMs) tend to overthink simple problems, generating unnecessarily long outputs, and underthink harder ones.<n>This indicates that models might misjudge problem difficulty and fail to calibrate their response length appropriately.<n> Experiments show that the generation length can be significantly reduced while maintaining acceptable accuracy.
arXiv Detail & Related papers (2025-04-30T18:48:06Z) - SCoRE: Benchmarking Long-Chain Reasoning in Commonsense Scenarios [33.72114830484246]
We introduce SCoRE (Scenario-based Commonsense Reasoning Evaluation), a benchmark that synthesizes multi-hop questions from scenario schemas of entities, relations, and logical rules.<n>SCoRE contains 100k bilingual (Chinese-English) multiple-choice questions whose reasoning chains span 2-11 hops and are grouped into various difficulty levels.
arXiv Detail & Related papers (2025-03-08T13:40:10Z) - LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion [20.293369733522983]
LongReason is a synthetic benchmark for evaluating the long-context reasoning capabilities of large language models.<n>LongReason consists of 794 multiple-choice reasoning questions with diverse reasoning patterns across three task categories.<n>We evaluate 21 LLMs on LongReason, revealing that most models experience significant performance drops as context length increases.
arXiv Detail & Related papers (2025-01-25T05:32:14Z) - O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [98.3430004984531]
We propose Length-Harmonizing Fine-Tuning (O1-Pruner) to minimize reasoning overhead while maintaining accuracy.<n>Our code is coming soon at https://github.com/StarDewXXX/O1-Pruner.
arXiv Detail & Related papers (2025-01-22T01:35:11Z) - Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models [48.35385912526338]
This paper explores the impact of extending input lengths on the capabilities of Large Language Models (LLMs)
We isolate the effect of input length using multiple versions of the same sample, each being extended with padding of different lengths, types and locations.
We show that the degradation trend appears in every version of our dataset, although at different intensities.
arXiv Detail & Related papers (2024-02-19T16:04:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.