Related papers: KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning

KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning

URL: http://arxiv.org/abs/2602.00400v1
Date: Fri, 30 Jan 2026 23:28:37 GMT
Title: KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning
Authors: Fan Yang, Rui Meng, Trudi Di Qi, Ali Ezzati, Yuxin Wen,
Abstract summary: Reinforcement learning has emerged as a promising paradigm for inducing explicit reasoning behaviors in large language and vision-language models.<n>However, reasoning-oriented RL post-training remains fundamentally challenging due to sparse trajectory-level rewards.<n>Recent on-policy distillation methods introduce dense teacher supervision to stabilize optimization, but apply it uniformly across all generated trajectories.
Score: 24.072603982041798
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) has emerged as a promising paradigm for inducing explicit reasoning behaviors in large language and vision-language models. However, reasoning-oriented RL post-training remains fundamentally challenging due to sparse trajectory-level rewards, leading to ambiguous credit assignment and severe exploration failures that can trap the policy in a ``learning cliff.'' Recent on-policy distillation methods introduce dense teacher supervision to stabilize optimization, but apply it uniformly across all generated trajectories. We argue that such uniform distillation is ill-suited for reasoning-intensive tasks, as low-quality on-policy trajectories often originate from early logical errors, and distillation under flawed contexts injects noisy and misaligned gradients. To address these challenges, we propose Knowledge-Enhanced Preference Optimization (KEPO), a unified post-training framework that integrates: (i) a quality-gated on-policy distillation objective that selectively applies dense teacher guidance only to high-quality trajectories, and (ii) a knowledge-enhanced exploration strategy that leverages hints learned from a teacher model to rejectively sample reward-positive on-policy trajectories for RL, thereby mitigating exploration collapse. Evaluated on a challenging medical visual question answering benchmark under single-source generalization, KEPO demonstrates improved training stability, more coherent reasoning behaviors, and superior out-of-distribution performance over reinforcement learning and on-policy distillation baselines.

Related papers

Reinforcement-aware Knowledge Distillation for LLM Reasoning [63.53679456364683]
Reinforcement learning (RL) post-training has recently driven gains in long chain-of-thought reasoning large language models (LLMs)<n>Most existing knowledge distillation methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization.<n>We propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update.
arXiv Detail & Related papers (2026-02-26T00:20:39Z)
Native Reasoning Models: Training Language Models to Reason on Unverifiable Data [16.065264121785294]
We introduce NRT (Native Reasoning Training), a novel framework that cultivates complex reasoning.<n>NRT reframes the training problem by treating the reasoning process as a latent variable.<n>NRT achieves state-of-the-art performance among verifier-free methods.
arXiv Detail & Related papers (2026-02-12T04:15:46Z)
RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization [40.41228010377401]
We propose Rephrasing Policy Optimization (RePO) to reconcile off-policy knowledge with the stability of on-policy RL.<n>RePO rephrases off-policy knowledge into trajectories that conform to its own stylistic and parametric distribution.<n> Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines.
arXiv Detail & Related papers (2026-02-11T13:02:40Z)
Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration [49.9937230730202]
We propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention.<n>Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories.<n>We show that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales.
arXiv Detail & Related papers (2026-02-03T15:32:09Z)
IIB-LPO: Latent Policy Optimization via Iterative Information Bottleneck [20.113524065146674]
Iterative Information Bottleneck (IIB-LPO) is a novel approach that shifts exploration from statistical perturbation of token to topological branching of reasoning trajectories.<n>IIB-LPO achieves state-of-the-art performance, surpassing prior methods by margins of up to 5.3% in accuracy and 7.4% in diversity metrics.
arXiv Detail & Related papers (2026-01-09T15:46:40Z)
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices [61.361819972410046]
We show why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE.<n>This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training.
arXiv Detail & Related papers (2025-12-01T07:45:39Z)
Latent Chain-of-Thought for Visual Reasoning [53.541579327424046]
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs)<n>We reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference.<n>We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks.
arXiv Detail & Related papers (2025-10-27T23:10:06Z)
Pinpointing crucial steps: Attribution-based Credit Assignment for Verifiable Reinforcement Learning [5.880405013005892]
ACPO is a phased framework that incorporates a difficulty-aware curriculum.<n>ACPO improves exploration by using trajectory semantic segmentation and an attribution-based representation.<n>It enhances exploitation with a factorized reward system that precisely quantifies the hierarchical contribution of each reasoning step.
arXiv Detail & Related papers (2025-10-10T01:22:55Z)
CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs [53.749193998004166]
Curriculum learning plays a crucial role in enhancing the training efficiency of large language models.<n>We propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead.
arXiv Detail & Related papers (2025-10-01T15:41:27Z)
Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning [87.7836502955847]
We propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning.<n>Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood.<n>We introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy.
arXiv Detail & Related papers (2025-06-10T12:40:39Z)
Navigate the Unknown: Enhancing LLM Reasoning with Intrinsic Motivation Guided Exploration [39.460202867967006]
We propose an Intrinsic Motivation guidEd exploratioN meThOd foR LLM Reasoning (i-MENTOR) to deliver dense rewards and amplify exploration in the RL-based paradigm.<n>Experiments across 4 public datasets demonstrate i-MENTOR's effectiveness, achieving a 22.23% improvement on AIME 2024.
arXiv Detail & Related papers (2025-05-23T08:30:28Z)
Training Large Language Models to Reason via EM Policy Gradient [0.27195102129094995]
We introduce an off-policy reinforcement learning algorithm, EM Policy Gradient, to enhance LLM reasoning.<n>We evaluate the effectiveness of EM Policy Gradient on the GSM8K and MATH (HARD) datasets.<n>Models fine-tuned with our method exhibit cognitive behaviors, such as sub-problem decomposition, self-verification, and backtracking.
arXiv Detail & Related papers (2025-04-24T01:31:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.