Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
- URL: http://arxiv.org/abs/2512.10739v1
- Date: Thu, 11 Dec 2025 15:26:28 GMT
- Title: Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
- Authors: Songyang Gao, Yuzhe Gu, Zijian Wu, Lingkai Kong, Wenwei Zhang, Zhongrui Cai, Fan Zheng, Tianyou Ma, Junhao Shen, Haiteng Zhao, Duanyang Zhang, Huilun Zhang, Kuikun Liu, Chengqi Lyu, Yanhui Duan, Chiyu Chen, Ningsheng Ma, Jianfei Gao, Han Lyu, Dahua Lin, Kai Chen,
- Abstract summary: We propose textbfOutcome-based textbfProcess textbfVerifier (OPV)<n>OPV verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification.<n>It achieves new state-of-the-art results on our held-out textscthisbench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3.
- Score: 65.02106674311908
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the \textbf{O}utcome-based \textbf{P}rocess \textbf{V}erifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out \textsc{\thisbench}, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2\% to 73.3\% on AIME2025 as the compute budget scales.
Related papers
- OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification [91.15649744496834]
We propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long chains of thought.<n>OPV achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3.
arXiv Detail & Related papers (2025-12-11T15:47:38Z) - A Preliminary Study on the Promises and Challenges of Native Top-$k$ Sparse Attention [33.03212783462742]
This report conducts a preliminary investigation into the effectiveness and theoretical mechanisms of the Top-$k$ Attention mechanism.<n>Experiments demonstrate that exact Top-$k$ Decoding achieves performance comparable to, or even surpassing, full attention on downstream tasks.<n>Considering the high computational complexity of exact Top-$k$ Attention, we investigate the impact of approximate Top-$k$ algorithm precision on downstream tasks.
arXiv Detail & Related papers (2025-12-03T06:44:02Z) - Efficiency vs. Fidelity: A Comparative Analysis of Diffusion Probabilistic Models and Flow Matching on Low-Resource Hardware [0.0]
Denoising Diffusion Probabilistic Models (DDPMs) have established a new state-of-the-art in generative image synthesis.<n>This study presents a comparative analysis of DDPMs against the emerging Flow Matching paradigm.
arXiv Detail & Related papers (2025-11-24T18:19:42Z) - From <Answer> to <Think>: Multidimensional Supervision of Reasoning Process for LLM Optimization [62.07990937720985]
Dimension-level Reward Model (DRM) is a new supervision framework for Large Language Models.<n>DRM evaluates the quality of a reasoning process along three fundamental, complementary, and interpretable dimensions.<n> Experimental results show that DRM provides effective supervision signals, guides the optimization of LLMs and enhances their reasoning ability.
arXiv Detail & Related papers (2025-10-13T14:29:15Z) - MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.74675519953898]
Long-chain reflective reasoning is a prerequisite for solving complex real-world problems.<n>We build a benchmark consisting 1,260 samples of 42 challenging synthetic tasks.<n>We generate post-training data and explore learning paradigms for exploiting such data.
arXiv Detail & Related papers (2025-10-09T17:53:58Z) - R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth? [63.51955244144878]
R-HORIZON is a method designed to stimulate long-horizon reasoning behaviors in Large Reasoning Models (LRMs)<n>Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons.<n>Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately.
arXiv Detail & Related papers (2025-10-09T13:16:22Z) - Continual Action Quality Assessment via Adaptive Manifold-Aligned Graph Regularization [53.82400605816587]
Action Quality Assessment (AQA) quantifies human actions in videos, supporting applications in sports scoring, rehabilitation, and skill evaluation.<n>A major challenge lies in the non-stationary nature of quality distributions in real-world scenarios.<n>We introduce Continual AQA (CAQA), which equips with Continual Learning capabilities to handle evolving distributions.
arXiv Detail & Related papers (2025-10-08T10:09:47Z) - Hierarchical Reasoning Model [16.223136644998203]
HRM executes sequential reasoning tasks in a single forward pass without explicit supervision of the intermediate process.<n>With only 27 million parameters, HRM achieves exceptional performance on complex reasoning tasks using only 1000 training samples.
arXiv Detail & Related papers (2025-06-26T19:39:54Z) - Token Constraint Decoding Improves Robustness on Question Answering for Large Language Models [4.078176555898098]
We introduce and evaluate Token Constraint Decoding (TCD)<n>This simple yet effective inference-time algorithm enforces alignment between token-level predictions to enhance robustness in noisy settings.<n>Our findings establish TCD as a practical, model-agnostic approach for improving reasoning stability under real-world imperfections.
arXiv Detail & Related papers (2025-06-11T05:33:56Z) - The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity [16.266145641151375]
Large Reasoning Models generate detailed thinking processes before providing answers.<n>We show that LRMs face a complete accuracy collapse beyond certain complexities.<n>We also investigate the reasoning traces in more depth, studying the patterns of explored solutions.
arXiv Detail & Related papers (2025-06-07T22:42:29Z) - OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation [32.89904706698024]
OThink-R1 is a method that prunes redundant reasoning steps while preserving logical validity.<n> Experiments across mathematical and question-answering tasks demonstrate that OThink-R1 reduces reasoning redundancy by almost 23% on average.
arXiv Detail & Related papers (2025-06-03T03:31:30Z) - A Novel Generative Model with Causality Constraint for Mitigating Biases in Recommender Systems [20.672668625179526]
Latent confounding bias can obscure the true causal relationship between user feedback and item exposure.<n>We propose a novel generative framework called Latent Causality Constraints for Debiasing representation learning in Recommender Systems.
arXiv Detail & Related papers (2025-05-22T14:09:39Z) - RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning [60.84707424369494]
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models (LLMs) on complex tasks.<n>We introduce the Reasoning Boundary Framework++ (RBF++), a framework for evaluating and optimizing measurable boundaries of CoT capability.
arXiv Detail & Related papers (2025-05-19T16:25:55Z) - Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges [4.668749313973097]
This paper systematically evaluate Large Language Models (LLMs) and Large Reasoning Models (LRMs) across three levels of reasoning complexity.<n>We curate 26 challenges where models answer directly or by Python Code Interpreter.<n>LRMs show robust performance across tasks with various levels of difficulty, often competing or surpassing traditional first-principle-based methods.
arXiv Detail & Related papers (2025-05-16T18:32:35Z) - Is PRM Necessary? Problem-Solving RL Implicitly Induces PRM Capability in LLMs [26.49278448640309]
We conduct a systematic investigation of the relationship between RL training and PRM capabilities.<n>Our findings demonstrate that problem-solving proficiency and process supervision capabilities represent complementary dimensions of reasoning.<n>We propose Self-PRM, an introspective framework in which models autonomously evaluate and rerank their generated solutions.
arXiv Detail & Related papers (2025-05-16T13:23:26Z) - Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning.<n>We show that the widely used beam search method suffers from unacceptable over-optimism.<n>We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - SAIS: Supervising and Augmenting Intermediate Steps for Document-Level
Relation Extraction [51.27558374091491]
We propose to explicitly teach the model to capture relevant contexts and entity types by supervising and augmenting intermediate steps (SAIS) for relation extraction.
Based on a broad spectrum of carefully designed tasks, our proposed SAIS method not only extracts relations of better quality due to more effective supervision, but also retrieves the corresponding supporting evidence more accurately.
arXiv Detail & Related papers (2021-09-24T17:37:35Z) - Global Optimization of Objective Functions Represented by ReLU Networks [77.55969359556032]
Neural networks can learn complex, non- adversarial functions, and it is challenging to guarantee their correct behavior in safety-critical contexts.
Many approaches exist to find failures in networks (e.g., adversarial examples), but these cannot guarantee the absence of failures.
We propose an approach that integrates the optimization process into the verification procedure, achieving better performance than the naive approach.
arXiv Detail & Related papers (2020-10-07T08:19:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.