Related papers: Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

URL: http://arxiv.org/abs/2602.10815v1
Date: Wed, 11 Feb 2026 12:55:15 GMT
Title: Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training
Authors: Aojun Lu, Tao Feng, Hangjie Yuan, Wei Li, Yanan Sun,
Abstract summary: Large-scale Vision-Language Models (VLMs) consistently achieve superior out-of-distribution (OOD) performance compared to those trained withSupervised Fine-Tuning (SFT)<n>This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples.<n> Experiments show that Difficulty-Curated SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency.
Score: 18.926351241813425
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC-SFT.

Related papers

Efficient RLVR Training via Weighted Mutual Information Data Selection [30.408074538619626]
Reinforcement learning (RL) plays a central role in improving the reasoning and alignment of large language models.<n>We introduce InSight, an INformation-guided data SamplInG metHod for RL Training, grounded in a weighted mutual information objective.<n>We show that expected uncertainty reduction decomposes into complementary difficulty- and evidence-dependent components, revealing a fundamental limitation of difficulty-only selection.
arXiv Detail & Related papers (2026-03-02T14:25:07Z)
Theoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models [56.12341509545198]
Large Language Models (LLMs) are pretrained on massive datasets and later instruction-tuned via supervised fine-tuning (SFT) or reinforcement learning (RL)<n>Best practices emphasize large, diverse pretraining data, whereas post-training operates differently.<n>We theoretically analyze transformers trained on an in-context weight prediction task for linear regression.
arXiv Detail & Related papers (2026-03-01T21:58:09Z)
Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training [61.1421888242439]
Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL)<n>We propose a framework to bridge this chasm by enabling On-Policy SFT.
arXiv Detail & Related papers (2026-02-12T17:59:58Z)
Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient Concentration [56.074760766965085]
PRISM achieves a dynamics-aware framework that arbitrates data based on its degree of cognitive conflict with the model's existing knowledge.<n>Our findings suggest that disentangling data based on internal optimization regimes is crucial for scalable and robust agent alignment.
arXiv Detail & Related papers (2026-01-12T05:43:20Z)
Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning [30.751908700207185]
SFT plays a crucial role across several scenarios.<n>SFT with only 2K achieves comparable or better reasoning performance to RL with 20K.<n>We identify a pervasive issue of deceptive rewards, where higher rewards fail to correlate with better reasoning accuracy in RL.
arXiv Detail & Related papers (2025-12-14T13:46:42Z)
Debunk the Myth of SFT Generalization [13.700645417996412]
A prevailing view holds that supervised fine-tuning (SFT) fails to generalize, whereas reinforcement learning (RL) attains broader robustness.<n>We show that much of SFT's perceived failure stems from frozen-prompt artifacts.<n>We ask whether SFT can generalize strictly harder tasks.
arXiv Detail & Related papers (2025-09-30T20:01:09Z)
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification [61.607788999847564]
We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM)<n>We reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model.<n>We propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token.
arXiv Detail & Related papers (2025-08-07T17:59:04Z)
Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved) [3.13388270461847]
We draw on a connection between supervised fine-tuning (SFT) and the theory and practice of finding optimal policies via Reinforcement Learning (RL)<n>We show that a small modification to SFT leads to an importance weighted variant that behaves closer to training with RL as it.<n>We refer to this variant as importance weighted supervised fine-tuning (iw-SFT)
arXiv Detail & Related papers (2025-07-17T07:26:54Z)
Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective [98.45690529036848]
Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks.<n>While effective at task adaptation, their impact on prior knowledge remains unclear.
arXiv Detail & Related papers (2025-06-30T04:15:01Z)
Angles Don't Lie: Unlocking Training-Efficient RL Through the Model's Own Signals [49.17123504516502]
CurrentReinforcement Fine-tuning (RFT) paradigms for Large Language Models (LLMs) suffer from inefficiency due to redundant exposure of identical queries under uniform data sampling.<n>We propose a Gradient-driven Angle-Informed Navigated RL framework.<n>By leveraging the model's intrinsic angle concentration signal, GAIN-RL dynamically selects training data in each epoch, ensuring consistently impactful gradient updates.
arXiv Detail & Related papers (2025-06-02T21:40:38Z)
Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization [67.8738082040299]
Self-Sampling Preference Optimization (SSPO) is a new alignment method for post-training reinforcement learning.<n>SSPO eliminates the need for paired data and reward models while retaining the training stability of SFT.<n>SSPO surpasses all previous approaches on the text-to-image benchmarks and demonstrates outstanding performance on the text-to-video benchmarks.
arXiv Detail & Related papers (2024-10-07T17:56:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.