Related papers: RLSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following

RLSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following

URL: http://arxiv.org/abs/2510.14200v1
Date: Thu, 16 Oct 2025 01:13:14 GMT
Title: RLSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following
Authors: Zhichao Wang, Andy Wong, Ruslan Belkin,
Abstract summary: We propose replacing SFT with RLSR to leverage the extensive SFT dataset in an RL framework.<n>In RLSR, the base model generates multiple responses for each prompt, and reward scores are computed as the cosine similarity in the semantic embedding space between the generated and human-labeled responses.
Score: 4.6740998081727385
License: http://creativecommons.org/licenses/by/4.0/
Abstract: After the pretraining stage of LLMs, techniques such as SFT, RLHF, RLVR, and RFT are applied to enhance instruction-following ability, mitigate undesired responses, improve reasoning capability and enable efficient domain adaptation with minimal data. SFT relies on the next-token prediction objective to strengthen instruction following in a base model using a large corpus of human-labeled responses. In contrast, RFT employs a RL-based approach to adapt fine-tuned reasoning models to specific domains with limited supervision. Inspired by RFT, we propose replacing SFT with RLSR to leverage the extensive SFT dataset in an RL framework, thereby improving the base model's instruction-following ability. In RLSR, the base model generates multiple responses for each prompt, and reward scores are computed as the cosine similarity in the semantic embedding space between the generated and human-labeled responses. RLSR can be utilized in multiple ways. It can directly replace SFT, achieving superior performance on instruction-following benchmarks-for example, RLSR (SB) on Qwen-7B (INFINITY) achieved an AlpacaEval win rate of 26.34%, surpassing SFT's 21.01%. Furthermore, combining SFT and RLSR further enhances downstream task performance; Qwen-7B (INFINITY) achieved a win rate of 30.73% when trained with SFT + RLSR.

Related papers

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning [8.550698116833123]
Post-training of reasoning LLMs typically consists of an offline SFT stage followed by an online reinforcement learning stage.<n>We show that, after identical RL training, models from stronger SFT checkpoints can significantly underperform those from weaker ones.<n>We propose PEAR, an SFT-stage method that corrects this mismatch and better prepares the model for RL.
arXiv Detail & Related papers (2026-02-01T06:53:45Z)
On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training [10.433802085981046]
Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL)<n>We show that RL increases SFT loss under SFT optimality and that SFT lowers the reward achieved by RL.<n> Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training.
arXiv Detail & Related papers (2026-01-12T10:14:09Z)
Trust-Region Adaptive Policy Optimization [82.09255251747818]
Post-training methods play an important role in improving large language models' (LLMs) complex reasoning abilities.<n>We introduce TRAPO, a framework that interleavesSupervised Fine-Tuning (SFT) and Reinforcement Learning (RL) within each training instance.<n>Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines.
arXiv Detail & Related papers (2025-12-19T14:37:07Z)
The Path Not Taken: RLVR Provably Learns Off the Principals [85.41043469428365]
We show that sparsity is a surface artifact of a model-conditioned optimization bias.<n>We mechanistically explain these dynamics with a Three-Gate Theory.<n>We provide a parameter-level characterization of RLVR's learning dynamics.
arXiv Detail & Related papers (2025-11-11T18:49:45Z)
Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners [28.039145840787683]
Supervised fine-tuning (SFT) offers complementary benefits but typically requires large-scale data and risks overfitting.<n>Recent attempts to combine SFT and RL face three main challenges: data inefficiency, algorithm-specific designs, and catastrophic forgetting.<n>We propose a plug-and-play framework that dynamically integrates SFT into RL by selecting challenging examples for SFT.
arXiv Detail & Related papers (2025-10-06T03:01:14Z)
Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead [20.446287312285648]
We study whether high SFT scores translate to improved performance after RL.<n>We find high SFT scores can be biased toward simpler or more homogeneous data and are not reliably predictive of subsequent RL gains or scaled-up post-training effectiveness.<n>We study alternative metrics and identify generalization loss on held-out reasoning examples and Pass@large k performance to provide strong proxies for the RL outcome.
arXiv Detail & Related papers (2025-10-02T02:57:00Z)
Reinforcement Learning on Pre-Training Data [55.570379963147424]
We introduce Reinforcement Learning on Pre-Training data (R), a new training-time scaling paradigm for optimizing large language models (LLMs)<n>R enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL)<n>Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of R.
arXiv Detail & Related papers (2025-09-23T17:10:40Z)
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification [61.607788999847564]
We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM)<n>We reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model.<n>We propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token.
arXiv Detail & Related papers (2025-08-07T17:59:04Z)
Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved) [3.13388270461847]
We draw on a connection between supervised fine-tuning (SFT) and the theory and practice of finding optimal policies via Reinforcement Learning (RL)<n>We show that a small modification to SFT leads to an importance weighted variant that behaves closer to training with RL as it.<n>We refer to this variant as importance weighted supervised fine-tuning (iw-SFT)
arXiv Detail & Related papers (2025-07-17T07:26:54Z)
Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective [98.45690529036848]
Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks.<n>While effective at task adaptation, their impact on prior knowledge remains unclear.
arXiv Detail & Related papers (2025-06-30T04:15:01Z)
AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy [48.30596996677882]
We investigate the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in developing strong reasoning models.<n> scaling strategies yield notable improvements in reasoning performance.<n>Our AceReason-Nemotron-1.1 7B model significantly outperforms AceReason-Nemotron-1.0 and new state-of-the-art performance among Qwen2.5-7B-based reasoning models.
arXiv Detail & Related papers (2025-06-16T09:27:48Z)
Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z)
Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process [19.986235452236272]
Supervised Fine-Tuning (SFT) and Preference Optimization (PO) are key processes for aligning Language Models (LMs) with human preferences post pre-training.<n>We introduce Intuitive Fine-Tuning (IFT) to integrate SFT and PO into a single process.<n>IFT performs comparably or even superiorly to SFT and some typical PO methods across several tasks.
arXiv Detail & Related papers (2024-05-20T08:23:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.