Related papers: Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy

Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy

URL: http://arxiv.org/abs/2512.21017v1
Date: Wed, 24 Dec 2025 07:24:31 GMT
Title: Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy
Authors: Xiaofeng Shi, Qian Kou, Yuduo Li, Hua Zhou,
Abstract summary: Chain-of-Thought (CoT) component has become significant for complex reasoning tasks.<n>In conventional Supervised Fine-Tuning (SFT), the model could allocate disproportionately more attention to CoT sequences with excessive length.<n>We propose SFTKey, a two-stage training scheme. In the first stage, conventional SFT is applied to ensure proper output format, while in the second stage, only the Key portion is fine-tuned to improve accuracy.
Score: 3.7208575749294392
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the rapid advancement of Large Language Models (LLMs), the Chain-of-Thought (CoT) component has become significant for complex reasoning tasks. However, in conventional Supervised Fine-Tuning (SFT), the model could allocate disproportionately more attention to CoT sequences with excessive length. This reduces focus on the much shorter but essential Key portion-the final answer, whose correctness directly determines task success and evaluation quality. To address this limitation, we propose SFTKey, a two-stage training scheme. In the first stage, conventional SFT is applied to ensure proper output format, while in the second stage, only the Key portion is fine-tuned to improve accuracy. Extensive experiments across multiple benchmarks and model families demonstrate that SFTKey achieves an average accuracy improvement exceeding 5\% over conventional SFT, while preserving the ability to generate correct formats. Overall, this study advances LLM fine-tuning by explicitly balancing CoT learning with additional optimization on answer-relevant tokens.

Related papers

On-Policy Supervised Fine-Tuning for Efficient Reasoning [27.67711115864118]
Large reasoning models (LRMs) are commonly trained with reinforcement learning (RL) to explore long chain-of-thought reasoning.<n>Recent methods add multi-reward objectives to jointly optimize correctness and brevity, but these complex extensions often destabilize training and yield suboptimal trade-offs.<n>We propose a simplified training strategy on-policy SFT, which reduces CoT length by up to 80 while maintaining original accuracy.
arXiv Detail & Related papers (2026-02-13T19:16:39Z)
CoT-Saliency: Unified Chain-of-Thought Reasoning for Heterogeneous Saliency Tasks [96.64597365827046]
We present the first unified framework that jointly handles three operationally heterogeneous saliency tasks.<n>We introduce a Chain-of-Thought (CoT) reasoning process in a Vision-Language Model (VLM) to bridge task heterogeneity.<n>We show our model matches or outperforms specialized SOTA methods and strong closed-source VLMs across all tasks.
arXiv Detail & Related papers (2025-11-01T04:37:01Z)
Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning [18.934789236342244]
Large language models (LLMs) primarily rely on supervised fine-tuning (SFT) to adapt pre-trained models to domain-specific tasks such as mathematical reasoning.<n>Standard SFT uniformly penalizes all tokens, neglecting that only a small subset of critical tokens determines reasoning correctness.<n>We propose Critical Token Fine-tuning (CFT), a simple yet effective approach that updates only tokens identified as functionally indispensable via counterfactual perturbations.
arXiv Detail & Related papers (2025-10-13T03:25:36Z)
Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm [8.405729585427226]
Chain-of-Thought (CoT) and its variants have markedly advanced the reasoning abilities of Large Language Models (LLMs)<n>We propose the Explore-Execute Chain ($E2C$), a structured reasoning framework that decouples reasoning into two distinct phases.
arXiv Detail & Related papers (2025-09-28T15:48:40Z)
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification [61.607788999847564]
We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM)<n>We reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model.<n>We propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token.
arXiv Detail & Related papers (2025-08-07T17:59:04Z)
R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [80.104336426172]
Chain-of-thought (CoT) enhances problem-solving ability of large language models.<n>CoT incurs substantial inference cost due to long autoregressive trajectories.<n>We introduce R-Stitch, a training-free hybrid decoding framework.
arXiv Detail & Related papers (2025-07-23T08:14:36Z)
Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z)
Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z)
Making Small Language Models Efficient Reasoners: Intervention, Supervision, Reinforcement [22.801244105119025]
We propose new algorithms to improve token-efficient reasoning with small-scale models by effectively trading off accuracy and computation.<n>We first show that the post-SFT model fails to determine the optimal stopping point of the reasoning process, resulting in verbose and repetitive outputs.<n>Experiments on four reasoning benchmarks, MATH500, AMC, AIME24 and OlympiadBench, demonstrate that TS is highly effective compared to s1's budget forcing approach.
arXiv Detail & Related papers (2025-05-12T18:04:39Z)
The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models [69.798277882245]
We introduce Unsupervised Prefix Fine-Tuning (UPFT) to enhance large language models' reasoning efficiency.<n>UPFT removes the need for labeled data or exhaustive sampling.<n> Experiments show that UPFT matches the performance of supervised methods.
arXiv Detail & Related papers (2025-03-04T18:56:03Z)
Quantize What Counts: More for Keys, Less for Values [63.51476878610841]
Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key-Value ( KV) cache.<n>This paper proposes two theorems that anchor mixed-precision KV quantization in the intrinsic geometry of Transformer models.
arXiv Detail & Related papers (2025-02-20T22:24:27Z)
Functional-level Uncertainty Quantification for Calibrated Fine-tuning on LLMs [38.33520071583312]
Calibrated Fine-Tuning (UQ4CT) captures and calibrates uncertainty over the space of functions that map input prompts to outputs.<n>We implement UQ4CT during the fine-tuning stage via a mixture-of-experts framework that hierarchically decomposes the functional space.<n>Even under distribution shift, UQ4CT maintains superior ECE performance with high accuracy, showcasing improved generalizability.
arXiv Detail & Related papers (2024-10-09T00:09:15Z)
Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process [19.986235452236272]
Supervised Fine-Tuning (SFT) and Preference Optimization (PO) are key processes for aligning Language Models (LMs) with human preferences post pre-training.<n>We introduce Intuitive Fine-Tuning (IFT) to integrate SFT and PO into a single process.<n>IFT performs comparably or even superiorly to SFT and some typical PO methods across several tasks.
arXiv Detail & Related papers (2024-05-20T08:23:28Z)
Prefix Text as a Yarn: Eliciting Non-English Alignment in Foundation Language Model [50.339632513018934]
supervised fine-tuning (SFT) has been a straightforward approach for tailoring the output of foundation large language model (LLM) to specific preferences. We critically examine this hypothesis within the scope of cross-lingual generation tasks. We introduce a novel training-free alignment method named PreTTY, which employs minimal task-related prior tokens.
arXiv Detail & Related papers (2024-04-25T17:19:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.