Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process
- URL: http://arxiv.org/abs/2405.11870v2
- Date: Tue, 28 May 2024 16:14:58 GMT
- Title: Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process
- Authors: Ermo Hua, Biqing Qi, Kaiyan Zhang, Yue Yu, Ning Ding, Xingtai Lv, Kai Tian, Bowen Zhou,
- Abstract summary: We introduce Intuitive Fine-Tuning (IFT) to integrate SFT and Preference Optimization into a single process.
IFT performs comparably or even superiorly to sequential recipes of SFT and some typical Preference Optimization methods.
An explainable Frozen Lake game further validates the effectiveness of IFT for getting competitive policy.
- Score: 26.196705232699884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Supervised Fine-Tuning (SFT) and Preference Optimization (PO) are two fundamental processes for enhancing the capabilities of Language Models (LMs) post pre-training, aligning them better with human preferences. Although SFT advances in training efficiency, PO delivers better alignment, thus they are often combined. However, common practices simply apply them sequentially without integrating their optimization objectives, ignoring the opportunities to bridge their paradigm gap and take the strengths from both. To obtain a unified understanding, we interpret SFT and PO with two sub-processes -- Preference Estimation and Transition Optimization -- defined at token level within the Markov Decision Process (MDP) framework. This modeling shows that SFT is only a specialized case of PO with inferior estimation and optimization. PO evaluates the quality of model's entire generated answer, whereas SFT only scores predicted tokens based on preceding tokens from target answers. Therefore, SFT overestimates the ability of model, leading to inferior optimization. Building on this view, we introduce Intuitive Fine-Tuning (IFT) to integrate SFT and Preference Optimization into a single process. IFT captures LMs' intuitive sense of the entire answers through a temporal residual connection, but it solely relies on a single policy and the same volume of non-preference-labeled data as SFT. Our experiments show that IFT performs comparably or even superiorly to sequential recipes of SFT and some typical Preference Optimization methods across several tasks, particularly those requires generation, reasoning, and fact-following abilities. An explainable Frozen Lake game further validates the effectiveness of IFT for getting competitive policy.
Related papers
- PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning [17.73193523921637]
Large language models (LLMs) have shown remarkable abilities in diverse natural language processing (NLP) tasks.
LLMs generally undergo supervised fine-tuning (SFT) followed by preference alignment to be usable in downstream applications.
This paper introduces PAFT, a new PArallel training paradigm for effective LLM Fine-Tuning.
arXiv Detail & Related papers (2024-06-25T20:11:37Z) - Discovering Preference Optimization Algorithms with and for Large Language Models [50.843710797024805]
offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs.
We perform objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention.
Experiments demonstrate the state-of-the-art performance of DiscoPOP, a novel algorithm that adaptively blends logistic and exponential losses.
arXiv Detail & Related papers (2024-06-12T16:58:41Z) - Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values.
We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO)
Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z) - SpaFL: Communication-Efficient Federated Learning with Sparse Models and Low computational Overhead [75.87007729801304]
SpaFL: a communication-efficient FL framework is proposed to optimize sparse model structures with low computational overhead.
Experiments show that SpaFL improves accuracy while requiring much less communication and computing resources compared to sparse baselines.
arXiv Detail & Related papers (2024-06-01T13:10:35Z) - Triple Preference Optimization: Achieving Better Alignment with Less Data in a Single Step Optimization [35.36615140853107]
Triple Preference Optimization (TPO) is designed to align large language models with three preferences without requiring a separate Supervised Fine-Tuned (SFT) model.
We show that TPO achieves superior results compared to models aligned through other methods such as SFT, DPO, KTO, IPO, CPO, and ORPO.
arXiv Detail & Related papers (2024-05-26T20:18:11Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - HFT: Half Fine-Tuning for Large Language Models [42.60438623804577]
Large language models (LLMs) with one or more fine-tuning phases have become a necessary step to unlock various capabilities.
In this paper, we find that by regularly resetting partial parameters, LLMs can restore some of the original knowledge.
We introduce Half Fine-Tuning (HFT) for LLMs, as a substitute for full fine-tuning (FFT), to mitigate the forgetting issues.
arXiv Detail & Related papers (2024-04-29T07:07:58Z) - Prefix Text as a Yarn: Eliciting Non-English Alignment in Foundation Language Model [50.339632513018934]
supervised fine-tuning (SFT) has been a straightforward approach for tailoring the output of foundation large language model (LLM) to specific preferences.
We critically examine this hypothesis within the scope of cross-lingual generation tasks.
We introduce a novel training-free alignment method named PreTTY, which employs minimal task-related prior tokens.
arXiv Detail & Related papers (2024-04-25T17:19:36Z) - Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts.
RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z) - Optimization-Free Test-Time Adaptation for Cross-Person Activity
Recognition [30.350005654271868]
Test-Time Adaptation aims to utilize the test stream to adjust predictions in real-time inference.
High computational cost makes it intractable to run on resource-constrained edge devices.
We propose an Optimization-Free Test-Time Adaptation framework for sensor-based HAR.
arXiv Detail & Related papers (2023-10-28T02:20:33Z) - Federated Bayesian Optimization via Thompson Sampling [33.087439644066876]
This paper presents federated Thompson sampling (FTS) which overcomes a number of key challenges of FBO and FL in a principled way.
We empirically demonstrate the effectiveness of FTS in terms of communication efficiency, computational efficiency, and practical performance.
arXiv Detail & Related papers (2020-10-20T09:42:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.