Related papers: Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning

Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning

URL: http://arxiv.org/abs/2502.11284v1
Date: Sun, 16 Feb 2025 21:57:35 GMT
Title: Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning
Authors: Mohit Raghavendra, Junmo Kang, Alan Ritter,
Abstract summary: Post-training of Large Language Models often involves a pipeline of Supervised Finetuning (SFT) followed by Preference Finetuning (PFT)<n>We study how to optimally allocate a fixed training data budget between the two stages.
Score: 18.381178799923514
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Post-training of Large Language Models often involves a pipeline of Supervised Finetuning (SFT) followed by Preference Finetuning (PFT) using methods like Direct Preference Optimization. Both stages require annotated data that are very different in structure and costs. We study how to optimally allocate a fixed training data budget between the two stages, through extensive experiments spanning four diverse tasks, multiple model sizes and various data annotation costs. Our findings reveal that just SFT on the base model dominates performance in low-data regimes ($<1,000$ annotated examples). With larger data-budgets, we observe that a combination of SFT and PFT, often with increasing portions allocated towards preference data yields optimal performance. However, completely eliminating SFT and running PFT directly on the base model yields suboptimal performance, described as the cold start problem on tasks like mathematics. We observe that this is due to the distribution shift arising from using DPO directly on the base model to elicit step-by-step reasoning. This limitation can be effectively addressed by allocating even a small portion ($<10$%) of the budget to SFT first, resulting in performance improvements of $15-20$% on analytical benchmarks like GSM8k. These results provide actionable insights for researchers and practitioners optimizing model development under budget constraints, where high-quality data curation often represents a significant portion of the total costs of model development.

Related papers

InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities [27.09178257629886]
InfiAlign is a scalable and sample-efficient post-training framework for large language models (LLMs)<n>At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning.<n>Our results highlight the effectiveness of combining principled data selection with full-stage post-training.
arXiv Detail & Related papers (2025-08-07T15:34:06Z)
Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting [1.5595148909011116]
Supervised Fine-Tuning (SFT) is a critical step for enhancing the instruction-following capabilities of Large Language Models (LLMs)<n>SFT often leads to a degradation of the model's general abilities, a phenomenon known as catastrophic forgetting.<n>We propose a novel and cost-effective SFT method that effectively mitigates catastrophic forgetting without requiring access to the original SFT data.
arXiv Detail & Related papers (2025-06-11T06:23:50Z)
Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework [10.317740844867913]
We build a simulator based on 472 language model pre-training runs with varying data compositions from the SlimPajama dataset. We observe that even simple acquisition functions can enable principled training decisions across training models from 20M to 1B kernels.
arXiv Detail & Related papers (2025-03-26T22:19:47Z)
The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models [69.798277882245]
We introduce Unsupervised Prefix Fine-Tuning (UPFT) to enhance large language models' reasoning efficiency. UPFT removes the need for labeled data or exhaustive sampling. Experiments show that UPFT matches the performance of supervised methods.
arXiv Detail & Related papers (2025-03-04T18:56:03Z)
Discriminative Finetuning of Generative Large Language Models without Reward Models and Preference Data [61.463946150106054]
Supervised fine-tuning (SFT) followed by preference optimization (PO) has become the standard for improving pretrained large language models (LLMs) We introduce Discriminative Fine-Tuning (DFT), a novel approach that eliminates the need for preference data. Our contributions include: (i) a discriminative probabilistic framework for fine-tuning LLMs by explicitly modeling the discriminative likelihood of an answer; (ii) efficient algorithms to optimize this discriminative likelihood; and (iii) extensive experiments demonstrating DFT's effectiveness, achieving performance better than SFT and comparable to if not
arXiv Detail & Related papers (2025-02-25T22:38:55Z)
Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models [12.500777267361102]
We introduce a novel textbfpreference-textbforiented supervised textbffine-textbftuning approach, namely PoFT.<n>The intuition is to boost SFT by imposing a particular preference: textitfavoring the target model over aligned LLMs on the same SFT data.<n>PoFT achieves stable and consistent improvements over the SFT baselines across different training datasets and base models.
arXiv Detail & Related papers (2024-12-17T12:49:14Z)
$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization [54.94545757220999]
$f$-PO is a novel framework that generalizes and extends existing approaches. We conduct experiments on state-of-the-art language models using benchmark datasets.
arXiv Detail & Related papers (2024-10-29T02:11:45Z)
Self-Steering Optimization: Autonomous Preference Optimization for Large Language Models [79.84205827056907]
We present Self-Steering Optimization ($SSO$), an algorithm that autonomously generates high-quality preference data.<n>$SSO$ employs a specialized optimization objective to build a data generator from the policy model itself, which is used to produce accurate and on-policy data.<n>Our evaluation shows that $SSO$ consistently outperforms baselines in human preference alignment and reward optimization.
arXiv Detail & Related papers (2024-10-22T16:04:03Z)
Compute-Constrained Data Selection [77.06528009072967]
We find that many powerful data selection methods are almost never compute-optimal.<n>For compute-optimal training, we find that perplexity and gradient data selection require training-to-selection model size ratios of 5x and 10x, respectively.
arXiv Detail & Related papers (2024-10-21T17:11:21Z)
ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood [14.512464277772194]
Aligned Supervised Fine-Tuning (ASFT) is an effective approach that better aligns Large Language Models with pair-wise datasets. ASFT mitigates the issue where the DPO loss function decreases the probability of generating human-dispreferred data. Extensive experiments demonstrate that ASFT is an effective alignment approach, consistently outperforming existing methods.
arXiv Detail & Related papers (2024-09-14T11:39:13Z)
Improving Large Models with Small models: Lower Costs and Better Performance [81.55672406002715]
We propose Data Shunt$+$ (DS$+$), a general paradigm for collaboration of small and large models. For instance, ChatGPT achieves an accuracy of $94.43%$ on Amazon Product sentiment analysis, and DS$+$ achieves an accuracy of $95.64%$, while the cost has been reduced to only $31.18%$.
arXiv Detail & Related papers (2024-06-15T14:44:43Z)
Triple Preference Optimization: Achieving Better Alignment with Less Data in a Single Step Optimization [35.36615140853107]
Triple Preference Optimization (TPO) is designed to align large language models with three preferences without requiring a separate Supervised Fine-Tuned (SFT) model. We show that TPO achieves superior results compared to models aligned through other methods such as SFT, DPO, KTO, IPO, CPO, and ORPO.
arXiv Detail & Related papers (2024-05-26T20:18:11Z)
Comparative Analysis of Different Efficient Fine Tuning Methods of Large Language Models (LLMs) in Low-Resource Setting [0.0]
We try to push the understanding of different fine-tuning strategies for large language models (LLMs) We compare state-of-the-art methods like vanilla fine-tuning and Pattern-Based Fine-Tuning (PBFT) on pre-trained models across two datasets, COLA and MNLI. Our findings suggest that these alternative strategies can exhibit out-of-domain generalization comparable to that of vanilla FT and PBFT.
arXiv Detail & Related papers (2024-05-21T20:08:52Z)
LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z)
AutoFT: Learning an Objective for Robust Fine-Tuning [60.641186718253735]
Foundation models encode rich representations that can be adapted to downstream tasks by fine-tuning. Current approaches to robust fine-tuning use hand-crafted regularization techniques. We propose AutoFT, a data-driven approach for robust fine-tuning.
arXiv Detail & Related papers (2024-01-18T18:58:49Z)
Functional Graphical Models: Structure Enables Offline Data-Driven Optimization [111.28605744661638]
We show how structure can enable sample-efficient data-driven optimization. We also present a data-driven optimization algorithm that infers the FGM structure itself.
arXiv Detail & Related papers (2024-01-08T22:33:14Z)
DavIR: Data Selection via Implicit Reward for Large Language Models [62.59514469369608]
DavIR is a model-based data selection method for post-training Large Language Models.<n>We show that 6% of Alpaca dataset selected with DavIR can steer both the LLaMA and Gemma model family to produce superior performance compared to the same models trained on the full 52K dataset.
arXiv Detail & Related papers (2023-10-16T07:26:24Z)
Scaled Prompt-Tuning for Few-Shot Natural Language Generation [9.399840807973545]
Large Language Models (LLMs) demonstrate stronger language understanding and generation capabilities. Memory demand and computation cost of fine-tuning LLMs on downstream tasks are non-negligible. We propose a Scaled Prompt-Tuning (SPT) method which surpasses conventional PT with better performance and generalization ability.
arXiv Detail & Related papers (2023-09-13T07:12:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.