Reinforcement Learning via Self-Distillation
- URL: http://arxiv.org/abs/2601.20802v1
- Date: Wed, 28 Jan 2026 17:45:12 GMT
- Title: Reinforcement Learning via Self-Distillation
- Authors: Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, Andreas Krause,
- Abstract summary: Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math.<n>Current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck.<n>We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO)<n>SDPO converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model.
- Score: 37.078107691613155
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context. Across scientific reasoning, tool use, and competitive programming on LiveCodeBench v6, SDPO improves sample efficiency and final accuracy over strong RLVR baselines. Notably, SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts. Finally, applying SDPO to individual questions at test time accelerates discovery on difficult binary-reward tasks, achieving the same discovery probability as best-of-k sampling or multi-turn conversations with 3x fewer attempts.
Related papers
- IDER: IDempotent Experience Replay for Reliable Continual Learning [36.35747221992653]
Catastrophic forgetting is a major challenge in continual learning (CL)<n>Existing uncertainty-aware CL methods suffer from high computational overhead and incompatibility with mainstream replay methods.<n>We propose idempotent experience replay (IDER), a novel approach based on the idempotent property where repeated function applications yield the same output.
arXiv Detail & Related papers (2026-02-28T12:34:18Z) - From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation [52.62655622099456]
We propose reinforcement learning with verifiable reference-based rewards (RLVRR)<n>Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e., reward chain)<n>In this way, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts, and style, which evaluates adherence to stylistic properties.
arXiv Detail & Related papers (2026-01-26T14:39:58Z) - Aletheia: What Makes RLVR For Code Verifiers Tick? [51.371034079170435]
Verifiers trained via Reinforcement Learning from Verifiable Rewards (RLVR) are a prominent fixture of the Large Language Model (LLM) post-training pipeline.<n>Code verifiers remain valuable toward judging model outputs in scenarios where execution feedback is hard to obtain.<n>We examine components of the RLVR-based verifier training recipe widely credited for its success.
arXiv Detail & Related papers (2026-01-17T22:30:45Z) - Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models [61.78513830395669]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs)<n>As models train longer and scale larger, more training prompts become residual prompts, those with zero variance rewards that provide no training signal.<n>We propose the Explore Residual Prompts in Policy Optimization framework, which encourages exploration on residual prompts and reactivates their training signals.
arXiv Detail & Related papers (2025-11-06T20:40:27Z) - LANPO: Bootstrapping Language and Numerical Feedback for Reinforcement Learning in LLMs [73.27182315028021]
LANPO is a framework that cleanly separates the roles of feedback: language guides exploration, while numerical rewards drive optimization.<n>Our work provides a robust method for integrating historical experiences into the LLM RL loop, creating more effective and data-efficient learning agents.
arXiv Detail & Related papers (2025-10-18T15:51:19Z) - LaSeR: Reinforcement Learning with Last-Token Self-Rewarding [54.72617309922891]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>Previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency.<n>We propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss.
arXiv Detail & Related papers (2025-10-16T17:55:11Z) - CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment [44.33395106709674]
Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback.<n>Current RLVR methods typically assign the same reward to every token.<n>This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure.
arXiv Detail & Related papers (2025-08-04T11:06:08Z) - Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation [37.36913210031282]
Preference-based reinforcement learning (PbRL) has shown impressive capabilities in training agents without reward engineering.
We propose SEER, an efficient PbRL method that integrates label smoothing and policy regularization techniques.
arXiv Detail & Related papers (2024-05-29T01:49:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.