When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models
- URL: http://arxiv.org/abs/2509.18762v3
- Date: Fri, 03 Oct 2025 01:46:09 GMT
- Title: When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models
- Authors: Yingming Zheng, Hanqi Li, Kai Yu, Lu Chen,
- Abstract summary: Large language models (LLMs) have achieved impressive performance across natural language processing (NLP) tasks.<n>As real-world applications increasingly demand longer context windows, continued pretraining and supervised fine-tuning (SFT) on long-context data has become a common approach.<n>We systematically investigate how SFT data length influences LLM behavior on short-context tasks.<n>We find that long-context SFT improves short-context performance, contrary to the commonly observed degradation from long-context pretraining.
- Score: 16.12256921806929
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have achieved impressive performance across natural language processing (NLP) tasks. As real-world applications increasingly demand longer context windows, continued pretraining and supervised fine-tuning (SFT) on long-context data has become a common approach. While the effects of data length in continued pretraining have been extensively studied, their implications for SFT remain unclear. In this work, we systematically investigate how SFT data length influences LLM behavior on short-context tasks. Counterintuitively, we find that long-context SFT improves short-context performance, contrary to the commonly observed degradation from long-context pretraining. To uncover the underlying mechanisms of this phenomenon, we first decouple and analyze two key components, Multi-Head Attention (MHA) and Feed-Forward Network (FFN), and show that both independently benefit from long-context SFT. We further study their interaction and reveal a knowledge preference bias: long-context SFT promotes contextual knowledge, while short-context SFT favors parametric knowledge, making exclusive reliance on long-context SFT suboptimal. Finally, we demonstrate that hybrid training mitigates this bias, offering explainable guidance for fine-tuning LLMs.
Related papers
- Reinforcement Fine-Tuning Enables MLLMs Learning Novel Tasks Stably [80.36077974826865]
Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks.<n>We study the behavior of SFT and RFT on an open-source multimodal model, Qwen2.5-VL.<n>Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly on novel tasks but maintains prior knowledge.
arXiv Detail & Related papers (2025-06-30T04:15:01Z) - Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z) - LIFT: Improving Long Context Understanding of Large Language Models through Long Input Fine-Tuning [45.30182393918228]
Long Input Fine-Tuning (LIFT) is a novel framework for long-context modeling.<n>LIFT dynamically adapts model parameters based on the long input.<n>Gated Memory is a specialized attention adapter that automatically balances long input memorization and ICL.
arXiv Detail & Related papers (2025-02-20T15:32:24Z) - LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data [19.79929012055293]
LongFaith is a novel pipeline for synthesizing faithful long-context reasoning instruction datasets.<n>By integrating ground truth and citation-based reasoning prompts, we eliminate distractions and improve the accuracy of reasoning chains.
arXiv Detail & Related papers (2025-02-18T06:40:23Z) - LIFT: Improving Long Context Understanding Through Long Input Fine-Tuning [35.31849814789343]
This paper introduces Long Input Fine-Tuning (LIFT) for long context modeling.<n>LIFT enables efficient processing of lengthy inputs without the computational burden of offline long-context adaptation.<n>The framework is further enhanced by integrating in-context learning and pre-LIFT supervised fine-tuning.
arXiv Detail & Related papers (2024-12-18T09:04:55Z) - Reducing Distraction in Long-Context Language Models by Focused Learning [6.803882766744194]
We propose a novel training method that enhances Large Language Models' ability to discern relevant information.
During fine-tuning with long contexts, we employ a retriever to extract the most relevant segments.
We then introduce an auxiliary contrastive learning objective to explicitly ensure that outputs from the original context and the retrieved sub-context are closely aligned.
arXiv Detail & Related papers (2024-11-08T19:27:42Z) - LongReward: Improving Long-context Large Language Models with AI Feedback [54.3321542678909]
LongReward is a novel method that provides rewards for long-context model responses from four human-valued dimensions.
Our experiments indicate that LongReward not only significantly improves models' long-context performance but also enhances their ability to follow short instructions.
arXiv Detail & Related papers (2024-10-28T17:50:42Z) - Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models [62.698520962933195]
Large Vision-Language Models (LVLMs) excel in cross-model tasks but experience performance declines in long-context reasoning.<n>We propose a novel training-free context pruning method that selectively removes less critical textual information.
arXiv Detail & Related papers (2024-10-25T17:59:09Z) - LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models [61.12177317970258]
LongSkywork is a long-context Large Language Model capable of processing up to 200,000 tokens.
We develop two novel methods for creating synthetic data.
LongSkywork achieves outstanding performance on a variety of long-context benchmarks.
arXiv Detail & Related papers (2024-06-02T03:34:41Z) - Instruction Tuning for Large Language Models: A Survey [52.86322823501338]
We make a systematic review of the literature, including the general methodology of supervised fine-tuning (SFT)<n>We also review the potential pitfalls of SFT along with criticism against it, along with efforts pointing out current deficiencies of existing strategies.
arXiv Detail & Related papers (2023-08-21T15:35:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.