Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without Rewards
- URL: http://arxiv.org/abs/2510.18814v1
- Date: Tue, 21 Oct 2025 17:15:56 GMT
- Title: Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without Rewards
- Authors: Mengqi Li, Lei Zhao, Anthony Man-Cho So, Ruoyu Sun, Xiao Li,
- Abstract summary: We present a self-help online supervised finetuning (OSFT) paradigm for LLM reasoning.<n>OSFT is a highly efficient training strategy for LLM reasoning, as it is reward-free and uses just one rollout by default.<n>We believe that OSFT offers an efficient and promising alternative to more complex, reward-based training paradigms.
- Score: 24.382221008037188
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a simple, self-help online supervised finetuning (OSFT) paradigm for LLM reasoning. In this paradigm, the model generates its own responses and is immediately finetuned on this self-generated data. OSFT is a highly efficient training strategy for LLM reasoning, as it is reward-free and uses just one rollout by default. Experiment results show that OSFT achieves downstream performance on challenging mathematical reasoning tasks comparable to strong reinforcement learning with verifiable rewards (RLVR) methods such as GRPO. Our ablation study further demonstrates the efficiency and robustness of OSFT. The major mechanism of OSFT lies in facilitating the model's own existing preference (latent knowledge) learned from pretraining, which leads to reasoning ability improvement. We believe that OSFT offers an efficient and promising alternative to more complex, reward-based training paradigms. Our code is available at https://github.com/ElementQi/OnlineSFT.
Related papers
- S3-CoT: Self-Sampled Succinct Reasoning Enables Efficient Chain-of-Thought LLMs [48.80914119283909]
Large language models equipped with chain-of-thought (CoT) achieve strong performance and offer a window into behavior.<n>Recent evidence suggests that improvements in CoT capabilities often come with redundant reasoning processes.<n>Our study presents a self-sampling framework based on activation steering for efficient CoT learning.
arXiv Detail & Related papers (2026-02-02T11:37:36Z) - Beyond Imitation: Recovering Dense Rewards from Demonstrations [64.05543657441218]
supervised fine-tuning is treated as a simple imitation learning process that only trains a policy to imitate expert behavior on datasets.<n>We prove that the SFT process does not just learn a policy, but also an implicit, dense, token-level reward model that explains the expert demonstrations.<n>Dense-Path REINFORCE consistently outperforms the original SFT models on instruction-following benchmarks.
arXiv Detail & Related papers (2025-10-02T18:58:26Z) - AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance [7.685078284407324]
Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)<n>Recent single-stage methods attempt to unify SFT and RL using principleds, but lack a mechanism for dynamically balancing the two paradigms.<n>We introduce textbf Meta Fine-Tuning (AMFT), a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward.
arXiv Detail & Related papers (2025-08-09T11:40:54Z) - On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification [61.607788999847564]
We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM)<n>We reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model.<n>We propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token.
arXiv Detail & Related papers (2025-08-07T17:59:04Z) - Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective [98.45690529036848]
Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks.<n>While effective at task adaptation, their impact on prior knowledge remains unclear.
arXiv Detail & Related papers (2025-06-30T04:15:01Z) - Bridging Supervised Learning and Reinforcement Learning in Math Reasoning [55.889740979706815]
Reinforcement Learning (RL) has played a central role in the recent surge of math abilities by enabling self-improvement through binary verifier signals.<n>In this work, we propose Negative-aware Fine-Tuning (NFT) -- a supervised approach that enables LLMs to reflect on their failures and improve autonomously with no external teachers.
arXiv Detail & Related papers (2025-05-23T17:17:40Z) - Refining Salience-Aware Sparse Fine-Tuning Strategies for Language Models [14.68920095399595]
sparsity-based PEFT (SPEFT) introduces trainable sparse adaptations to the weight matrices in the model.<n>We conduct the first systematic evaluation of salience metrics for SPEFT, inspired by zero-cost NAS proxies.<n>We compare static and dynamic masking strategies, finding that static masking, which predetermines non-zero entries before training, delivers efficiency without sacrificing performance.
arXiv Detail & Related papers (2024-12-18T04:14:35Z) - PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning [17.73193523921637]
Large language models (LLMs) have shown remarkable abilities in diverse natural language processing (NLP) tasks.
LLMs generally undergo supervised fine-tuning (SFT) followed by preference alignment to be usable in downstream applications.
This paper introduces PAFT, a new PArallel training paradigm for effective LLM Fine-Tuning.
arXiv Detail & Related papers (2024-06-25T20:11:37Z) - Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment [65.15914284008973]
We propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model.
We show that the proposed algorithms converge to the stationary solutions of the IRL problem.
Our results indicate that it is beneficial to leverage reward learning throughout the entire alignment process.
arXiv Detail & Related papers (2024-05-28T07:11:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.