Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization
- URL: http://arxiv.org/abs/2508.07629v2
- Date: Tue, 12 Aug 2025 07:59:00 GMT
- Title: Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization
- Authors: Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou,
- Abstract summary: We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving.<n>This report provides an in-depth analysis of the reasoning model, covering the entire post-training workflow.
- Score: 17.149024413701014
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Although there are already many excellent works related to inference models in the current community, there are still many problems with reproducing high-performance inference models due to incomplete disclosure of training details. This report provides an in-depth analysis of the reasoning model, covering the entire post-training workflow from data preparation and long Chain-of-Thought supervised fine-tuning (long CoT SFT) to reinforcement learning (RL), along with detailed ablation studies for each experimental component. For SFT data, our experiments show that a small number of high-quality data sources are more effective than a large number of diverse data sources, and that difficult samples can achieve better results without accuracy filtering. In addition, we investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens. GPPO not only enhances the model's exploration capacity but also improves its efficiency in learning from negative samples. Klear-Reasoner exhibits exceptional reasoning abilities in mathematics and programming, scoring 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5 and 58.1% on LiveCodeBench V6.
Related papers
- Filtering with Confidence: When Data Augmentation Meets Conformal Prediction [10.431187957711206]
We propose a principled data filtering framework for synthetic data augmentation.<n>We show consistent performance improvements of up to 40% in F1 score over unaugmented baselines.
arXiv Detail & Related papers (2025-09-25T19:42:02Z) - Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward [54.708851958671794]
We propose a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection.<n>In offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty.<n>During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential.
arXiv Detail & Related papers (2025-09-01T10:04:20Z) - InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities [27.09178257629886]
InfiAlign is a scalable and sample-efficient post-training framework for large language models (LLMs)<n>At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning.<n>Our results highlight the effectiveness of combining principled data selection with full-stage post-training.
arXiv Detail & Related papers (2025-08-07T15:34:06Z) - LearnAlign: Reasoning Data Selection for Reinforcement Learning in Large Language Models Based on Improved Gradient Alignment [14.655048266761783]
Reinforcement learning (RL) has become a key technique for enhancing LLMs' reasoning abilities, yet its data inefficiency remains a major bottleneck.<n>We present LearnAlign, which intelligently selects the learnable and representative training reasoning data for RL post-training.<n> Experiments across three mathematical reasoning benchmarks demonstrate that our method significantly reduces training data requirements.
arXiv Detail & Related papers (2025-06-13T06:05:58Z) - Dissecting Long-Chain-of-Thought Reasoning Models: An Empirical Study [91.78803511141975]
This work focuses on the roles of positive and negative samples in scaling reinforcement learning.<n>We identify substantial data inefficiency in group relative policy optimization, where over half of the samples yield zero advantage.<n>We investigate unstable performance across various reasoning models and benchmarks, attributing instability to uncertain problems with ambiguous outcomes.
arXiv Detail & Related papers (2025-06-05T11:47:10Z) - Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z) - OpenCodeReasoning: Advancing Data Distillation for Competitive Coding [61.15402517835137]
We build a supervised fine-tuning (SFT) dataset to achieve state-of-the-art coding capability results in models of various sizes.<n>Our models use only SFT to achieve 61.8% on LiveCodeBench and 24.6% on CodeContests, surpassing alternatives trained with reinforcement learning.
arXiv Detail & Related papers (2025-04-02T17:50:31Z) - Crafting Efficient Fine-Tuning Strategies for Large Language Models [2.633490094119608]
Fine-tuning large language models (LLMs) with as few as 200 samples can improve model accuracy from 70% to 88% in a product attribute extraction task.
A bayesian hyperparameter optimization method, which evaluates models at 20% of total training time, correlates strongly with final model performance.
This approach led to a 2% improvement in accuracy over baseline models when evaluated on an independent test set.
arXiv Detail & Related papers (2024-07-18T21:36:00Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Scaling Relationship on Learning Mathematical Reasoning with Large
Language Models [75.29595679428105]
We investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM.
We find that rejection samples from multiple models push LLaMA-7B to an accuracy of 49.3% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
arXiv Detail & Related papers (2023-08-03T15:34:01Z) - Robust Learning with Progressive Data Expansion Against Spurious
Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features.
Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process.
We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.