Related papers: RPO:Reinforcement Fine-Tuning with Partial Reasoning Optimization

RPO:Reinforcement Fine-Tuning with Partial Reasoning Optimization

URL: http://arxiv.org/abs/2601.19404v2
Date: Fri, 30 Jan 2026 08:18:54 GMT
Title: RPO:Reinforcement Fine-Tuning with Partial Reasoning Optimization
Authors: Hongzhu Yi, Xinming Wang, Zhenghao zhang, Tianyu Zong, Yuanxiang Wang, Jun Xie, Tao Yu, Haopeng Jin, Kaixin Xu, Feng Chen, Jiahuan Chen, Yujia Yang, Zhenyu Guan, Bingkang Shi, Jungang Xu,
Abstract summary: We propose Reinforcement Fine-Tuning with Partial Reasoning Optimization (RPO), a plug-and-play reinforcement fine-tuning algorithm.<n>RPO trains the model by generating suffixes of the reasoning path using experience cache.<n>Compared with full-path reinforcement fine-tuning algorithms, RPO reduces the training time of the 1.5B model by 90% and the 7B model by 72%.
Score: 28.66426135031355
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Within the domain of large language models, reinforcement fine-tuning algorithms necessitate the generation of a complete reasoning trajectory beginning from the input query, which incurs significant computational overhead during the rollout phase of training. To address this issue, we analyze the impact of different segments of the reasoning path on the correctness of the final result and, based on these insights, propose Reinforcement Fine-Tuning with Partial Reasoning Optimization (RPO), a plug-and-play reinforcement fine-tuning algorithm. Unlike traditional reinforcement fine-tuning algorithms that generate full reasoning paths, RPO trains the model by generating suffixes of the reasoning path using experience cache. During the rollout phase of training, RPO reduces token generation in this phase by approximately 95%, greatly lowering the theoretical time overhead. Compared with full-path reinforcement fine-tuning algorithms, RPO reduces the training time of the 1.5B model by 90% and the 7B model by 72%. At the same time, it can be integrated with typical algorithms such as GRPO and DAPO, enabling them to achieve training acceleration while maintaining performance comparable to the original algorithms. Our code is open-sourced at https://github.com/yhz5613813/RPO.

Related papers

Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z)
Learning to Reason as Action Abstractions with Scalable Mid-Training RL [55.24192942739207]
An effective mid-training phase should identify a compact set of useful actions and enable fast selection.<n>We propose Reasoning as Action Abstractions (RA3), a scalable mid-training algorithm.
arXiv Detail & Related papers (2025-09-30T05:34:20Z)
Reshaping the Forward-Forward Algorithm with a Similarity-Based Objective [1.0064374190752632]
Forward-Forward algorithm is proposed as a more biologically plausible method that replaces the backward pass with an additional forward pass.<n>In this work, the Forward-Forward algorithm is reshaped through its integration with similarity learning frameworks, eliminating the need for multiple forward passes during inference.<n> Empirical evaluations on MNIST, Fashion-MNIST, and CIFAR-10 datasets indicate that FAUST substantially improves accuracy, narrowing the gap with backpropagation.
arXiv Detail & Related papers (2025-08-29T10:23:03Z)
TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling [65.46347858249295]
TreePO is a self-guided rollout algorithm that views sequence generation as a tree-structured searching process.<n>TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity.
arXiv Detail & Related papers (2025-08-24T16:52:37Z)
Proximal Algorithm Unrolling: Flexible and Efficient Reconstruction Networks for Single-Pixel Imaging [45.39911367007956]
Deep-unrolling and plug-and-play approaches have become the de-facto for single-pixel imaging (SPI) inverse problem.<n>In this paper, we address the challenge of integrating the strengths of both classes of solvers.
arXiv Detail & Related papers (2025-05-29T07:16:57Z)
CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models [77.16976971950785]
This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models.<n>CPPO prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates.<n> Experiments show that CPPO achieves up to $7.98times$ speedup on GSM8K and $3.48times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO.
arXiv Detail & Related papers (2025-03-28T11:30:05Z)
Streaming Looking Ahead with Token-level Self-reward [50.699168440048716]
We propose a policy model with token-level self-reward modeling (TRM) capability to eliminate the need for external models and extra communication.<n>In addition, we propose a streaming-looking-ahead (SLA) algorithm to further boost search efficiency with better parallelization.<n>If we combine SLA with reinforcement fine-tuning techniques such as DPO, SLA achieves an overall win rate of 89.4%.
arXiv Detail & Related papers (2025-02-24T22:35:53Z)
Reveal the Mystery of DPO: The Connection between DPO and RL Algorithms [23.399177886166882]
Reinforcement Learning from Human Feedback (RLHF) algorithms have been introduced to improve model safety and alignment with human preferences.<n>These algorithms can be divided into two main frameworks based on whether they require an explicit reward (or value) function for training.<n>We focus on three key aspects related to DPO, RL, and other RLHF algorithms.
arXiv Detail & Related papers (2025-02-05T11:41:43Z)
Surpassing legacy approaches to PWR core reload optimization with single-objective Reinforcement learning [0.0]
We have developed methods based on Deep Reinforcement Learning (DRL) for both single- and multi-objective optimization. In this paper, we demonstrate the advantage of our RL-based approach, specifically using Proximal Policy Optimization (PPO) PPO adapts its search capability via a policy with learnable weights, allowing it to function as both a global and local search method.
arXiv Detail & Related papers (2024-02-16T19:35:58Z)
Path Planning using Reinforcement Learning: A Policy Iteration Approach [0.0]
This research aims to shed light on the design space exploration associated with reinforcement learning parameters. We propose an auto-tuner-based ordinal regression approach to accelerate the process of exploring these parameters. Our approach provides 1.82x peak speedup with an average of 1.48x speedup over the previous state-of-the-art.
arXiv Detail & Related papers (2023-03-13T23:44:40Z)
Phase Retrieval using Expectation Consistent Signal Recovery Algorithm based on Hypernetwork [73.94896986868146]
Phase retrieval is an important component in modern computational imaging systems. Recent advances in deep learning have opened up a new possibility for robust and fast PR. We develop a novel framework for deep unfolding to overcome the existing limitations.
arXiv Detail & Related papers (2021-01-12T08:36:23Z)
Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO [90.90009491366273]
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms. Specifically, we investigate the consequences of "code-level optimizations:" Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function.
arXiv Detail & Related papers (2020-05-25T16:24:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.