ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection
- URL: http://arxiv.org/abs/2601.09195v1
- Date: Wed, 14 Jan 2026 05:50:40 GMT
- Title: ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection
- Authors: Tao Liu, Taiqiang Wu, Runming Yang, Shaoning Sun, Junjie Wang, Yujiu Yang,
- Abstract summary: Supervised fine-tuning is a strategy to align Large Language Models with human intent.<n>Traditional SFT often ignores the one-to-many nature of language by forcing alignment with a single reference answer.<n>We propose ProFit, which selectively masks low-probability tokens to prevent surface-level overfitting.
- Score: 47.413985185291864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Supervised fine-tuning (SFT) is a fundamental post-training strategy to align Large Language Models (LLMs) with human intent. However, traditional SFT often ignores the one-to-many nature of language by forcing alignment with a single reference answer, leading to the model overfitting to non-core expressions. Although our empirical analysis suggests that introducing multiple reference answers can mitigate this issue, the prohibitive data and computational costs necessitate a strategic shift: prioritizing the mitigation of single-reference overfitting over the costly pursuit of answer diversity. To achieve this, we reveal the intrinsic connection between token probability and semantic importance: high-probability tokens carry the core logical framework, while low-probability tokens are mostly replaceable expressions. Based on this insight, we propose ProFit, which selectively masks low-probability tokens to prevent surface-level overfitting. Extensive experiments confirm that ProFit consistently outperforms traditional SFT baselines on general reasoning and mathematical benchmarks.
Related papers
- Do It for HER: First-Order Temporal Logic Reward Specification in Reinforcement Learning (Extended Version) [49.462399222747024]
We propose a novel framework for the logical specification of non-Markovian rewards in Decision Processes (MDPs) with large state spaces.<n>Our approach leverages Linear Temporal Logic Modulo Theories over finite traces (LTLfMT)<n>We introduce a method based on reward machines and Hindsight Experience Replay (HER) to translate first-order logic specifications and address reward sparsity.
arXiv Detail & Related papers (2026-02-05T22:11:28Z) - CORE: Context-Robust Remasking for Diffusion Language Models [51.59514489363897]
We propose Context-Robust Remasking (CORE), a training-free framework for inference-time revision.<n>Rather than trusting static token probabilities, CORE identifies context-brittle tokens by probing their sensitivity to targeted masked-context perturbations.<n>On LLaDA-8B-Base, CORE delivers consistent improvements across reasoning and code benchmarks, outperforming compute-matched baselines and improving MBPP by up to 9.2 percentage points.
arXiv Detail & Related papers (2026-02-04T00:12:30Z) - Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge [87.51901436392427]
Large language models often solve complex reasoning tasks more effectively with Chain-of-Thought (CoT)<n>Humans, by contrast, often reason softly by maintaining a tractable probability distribution over plausible next steps.<n>We propose Multiplex Thinking, a soft reasoning mechanism that samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token.<n>Multiplex Thinking is self-adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT.
arXiv Detail & Related papers (2026-01-13T18:48:00Z) - Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning [18.934789236342244]
Large language models (LLMs) primarily rely on supervised fine-tuning (SFT) to adapt pre-trained models to domain-specific tasks such as mathematical reasoning.<n>Standard SFT uniformly penalizes all tokens, neglecting that only a small subset of critical tokens determines reasoning correctness.<n>We propose Critical Token Fine-tuning (CFT), a simple yet effective approach that updates only tokens identified as functionally indispensable via counterfactual perturbations.
arXiv Detail & Related papers (2025-10-13T03:25:36Z) - On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification [61.607788999847564]
We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM)<n>We reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model.<n>We propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token.
arXiv Detail & Related papers (2025-08-07T17:59:04Z) - Beyond Exponential Decay: Rethinking Error Accumulation in Large Language Models [0.0]
We show that errors are not uniformly distributed but are concentrated at sparse "key tokens" representing critical decision junctions.<n>We propose a framework for next-generation systems centered on selective preservation of semantically vital tokens.
arXiv Detail & Related papers (2025-05-30T03:57:31Z) - Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack [44.205352310633174]
Large Language Models (LLMs) are increasingly evaluated on multiple-choice question answering (MCQA) tasks.<n>We propose a solution: the *prefilling attack*, a structured natural-language prefix (e.g., "*The correct option is:*") prepended to the model output.<n>Our findings suggest that prefilling is a simple, robust, and low-cost method to enhance the reliability of FTP-based evaluation in multiple-choice settings.
arXiv Detail & Related papers (2025-05-21T09:58:38Z) - Language Model Uncertainty Quantification with Attention Chain [9.093726246465117]
Large language models' (LLM) predictive uncertainty is crucial for judging the reliability of its answers.<n>We propose UQAC, an efficient method that narrows the reasoning space to a tractable size for marginalization.<n>We validate UQAC on multiple reasoning benchmarks with advanced open-source LLMs.
arXiv Detail & Related papers (2025-03-24T21:43:47Z) - Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection [49.15148871877941]
Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs)<n>We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length.<n>PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
arXiv Detail & Related papers (2025-01-07T17:00:49Z) - Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process [19.986235452236272]
Supervised Fine-Tuning (SFT) and Preference Optimization (PO) are key processes for aligning Language Models (LMs) with human preferences post pre-training.<n>We introduce Intuitive Fine-Tuning (IFT) to integrate SFT and PO into a single process.<n>IFT performs comparably or even superiorly to SFT and some typical PO methods across several tasks.
arXiv Detail & Related papers (2024-05-20T08:23:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.