Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training
- URL: http://arxiv.org/abs/2602.10815v1
- Date: Wed, 11 Feb 2026 12:55:15 GMT
- Title: Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training
- Authors: Aojun Lu, Tao Feng, Hangjie Yuan, Wei Li, Yanan Sun,
- Abstract summary: Large-scale Vision-Language Models (VLMs) consistently achieve superior out-of-distribution (OOD) performance compared to those trained withSupervised Fine-Tuning (SFT)<n>This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples.<n> Experiments show that Difficulty-Curated SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency.
- Score: 18.926351241813425
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC-SFT.
Related papers
- Efficient RLVR Training via Weighted Mutual Information Data Selection [30.408074538619626]
Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models.<n>We introduce InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective.<n>We show that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection.
arXiv Detail & Related papers (2026-03-02T14:25:07Z) - Theoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models [56.12341509545198]
Large Language Models (LLMs) are pretrained on massive datasets and later instruction-tuned via supervised fine-tuning (SFT) or reinforcement learning (RL)<n>Best practices emphasize large, diverse pretraining data, whereas post-training operates differently.<n>We theoretically analyze transformers trained on an in-context weight prediction task for linear regression.
arXiv Detail & Related papers (2026-03-01T21:58:09Z) - Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training [61.1421888242439]
Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL)<n>We propose a framework to bridge this chasm by enabling On-Policy SFT.
arXiv Detail & Related papers (2026-02-12T17:59:58Z) - Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient Concentration [56.074760766965085]
PRISM achieves a dynamics-aware framework that arbitrates data based on its degree of cognitive conflict with the model's existing knowledge.<n>Our findings suggest that disentangling data based on internal optimization regimes is crucial for scalable and robust agent alignment.
arXiv Detail & Related papers (2026-01-12T05:43:20Z) - Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning [30.751908700207185]
SFT plays a crucial role across several scenarios.<n>SFT with only 2K achieves comparable or better reasoning performance to RL with 20K.<n>We identify a pervasive issue of deceptive rewards, where higher rewards fail to correlate with better reasoning accuracy in RL.
arXiv Detail & Related papers (2025-12-14T13:46:42Z) - Debunk the Myth of SFT Generalization [13.700645417996412]
A prevailing view holds that supervised fine-tuning (SFT) fails to generalize, whereas reinforcement learning (RL) attains broader robustness.<n>We show that much of SFT's perceived failure stems from frozen-prompt artifacts.<n>We ask whether SFT can generalize strictly harder tasks.
arXiv Detail & Related papers (2025-09-30T20:01:09Z) - On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification [61.607788999847564]
We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM)<n>We reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model.<n>We propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token.
arXiv Detail & Related papers (2025-08-07T17:59:04Z) - Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved) [3.13388270461847]
We draw on a connection between supervised fine-tuning (SFT) and the theory and practice of finding optimal policies via Reinforcement Learning (RL)<n>We show that a small modification to SFT leads to an importance weighted variant that behaves closer to training with RL as it.<n>We refer to this variant as importance weighted supervised fine-tuning (iw-SFT)
arXiv Detail & Related papers (2025-07-17T07:26:54Z) - Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective [98.45690529036848]
Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks.<n>While effective at task adaptation, their impact on prior knowledge remains unclear.
arXiv Detail & Related papers (2025-06-30T04:15:01Z) - Angles Don't Lie: Unlocking Training-Efficient RL Through the Model's Own Signals [49.17123504516502]
CurrentReinforcement Fine-tuning (RFT) paradigms for Large Language Models (LLMs) suffer from inefficiency due to redundant exposure of identical queries under uniform data sampling.<n>We propose a Gradient-driven Angle-Informed Navigated RL framework.<n>By leveraging the model's intrinsic angle concentration signal, GAIN-RL dynamically selects training data in each epoch, ensuring consistently impactful gradient updates.
arXiv Detail & Related papers (2025-06-02T21:40:38Z) - Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization [67.8738082040299]
Self-Sampling Preference Optimization (SSPO) is a new alignment method for post-training reinforcement learning.<n>SSPO eliminates the need for paired data and reward models while retaining the training stability of SFT.<n>SSPO surpasses all previous approaches on the text-to-image benchmarks and demonstrates outstanding performance on the text-to-video benchmarks.
arXiv Detail & Related papers (2024-10-07T17:56:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.