SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning
- URL: http://arxiv.org/abs/2602.07464v1
- Date: Sat, 07 Feb 2026 09:39:21 GMT
- Title: SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning
- Authors: Yijie Chen, Yijin Liu, Fandong Meng,
- Abstract summary: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has emerged as the standard post-training paradigm for large language models (LLMs)<n>We propose SED-SFT, which adaptively encourages diversity based on the token exploration space.<n>This framework introduces a selective entropy regularization term with a selective masking mechanism into the optimization objective.
- Score: 54.393763477932474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has emerged as the standard post-training paradigm for large language models (LLMs). However, the conventional SFT process, driven by Cross-Entropy (CE) loss, often induces mode collapse, where models over-concentrate on specific response patterns. This lack of distributional diversity severely restricts the exploration efficiency required for subsequent RL. While recent studies have attempted to improve SFT by replacing the CE loss, aiming to preserve diversity or refine the update policy, they fail to adequately balance diversity and accuracy, thereby yielding suboptimal performance after RL. To address the mode collapse problem, we propose SED-SFT, which adaptively encourages diversity based on the token exploration space. This framework introduces a selective entropy regularization term with a selective masking mechanism into the optimization objective. Extensive experiments across eight mathematical benchmarks demonstrate that SED-SFT significantly enhances generation diversity with a negligible computational overhead increase compared with CE loss, yielding average improvements of 2.06 and 1.20 points in subsequent RL performance over standard CE-based baselines on Llama-3.2-3B-Instruct and Qwen2.5-Math-7B-Instruct, respectively. The code is publicly available at https://github.com/pppa2019/SED-SFT
Related papers
- BinaryPPO: Efficient Policy Optimization for Binary Classification [10.249166265785686]
Supervised fine-tuning (SFT) is the standard approach for binary classification tasks.<n>We introduce BinaryPPO, a framework that reformulates binary classification as a reward learning problem.<n> BinaryPPO improves accuracy by 40-60 percentage points, reaching up to 99%, substantially supervised baselines.
arXiv Detail & Related papers (2026-02-02T19:22:45Z) - Trust-Region Adaptive Policy Optimization [82.09255251747818]
Post-training methods play an important role in improving large language models' (LLMs) complex reasoning abilities.<n>We introduce TRAPO, a framework that interleavesSupervised Fine-Tuning (SFT) and Reinforcement Learning (RL) within each training instance.<n>Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines.
arXiv Detail & Related papers (2025-12-19T14:37:07Z) - Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling [90.87033586963828]
Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs)<n>We propose Self-Consistency Sampling (SCS) to correct this issue.<n>Based on Qwen2.5-VL-7B-Instruct, SCS improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation.
arXiv Detail & Related papers (2025-11-13T18:59:57Z) - Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning [18.934789236342244]
Large language models (LLMs) primarily rely on supervised fine-tuning (SFT) to adapt pre-trained models to domain-specific tasks such as mathematical reasoning.<n>Standard SFT uniformly penalizes all tokens, neglecting that only a small subset of critical tokens determines reasoning correctness.<n>We propose Critical Token Fine-tuning (CFT), a simple yet effective approach that updates only tokens identified as functionally indispensable via counterfactual perturbations.
arXiv Detail & Related papers (2025-10-13T03:25:36Z) - MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources [113.33902847941941]
Variance-Aware Sampling (VAS) is a data selection strategy guided by Variance Promotion Score (VPS)<n>We release large-scale, carefully curated resources containing 1.6M long CoT cold-start data and 15k RL QA pairs.<n> Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS.
arXiv Detail & Related papers (2025-09-25T14:58:29Z) - On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification [61.607788999847564]
We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM)<n>We reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model.<n>We propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token.
arXiv Detail & Related papers (2025-08-07T17:59:04Z) - Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z) - Preserving Diversity in Supervised Fine-Tuning of Large Language Models [29.02934952075354]
This paper introduces a new game-theoretic formulation forSupervised Fine-Tuning (SFT)<n>In this framework, an auxiliary variable is introduced to regulate the learning process.<n>We prove that the proposed game-theoretic approach connects to the problem of reverse KL minimization with entropy regularization.
arXiv Detail & Related papers (2024-08-29T16:21:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.