Federated Fine-Tuning of Large Language Models: Kahneman-Tversky vs. Direct Preference Optimization
- URL: http://arxiv.org/abs/2502.14187v1
- Date: Thu, 20 Feb 2025 01:44:21 GMT
- Title: Federated Fine-Tuning of Large Language Models: Kahneman-Tversky vs. Direct Preference Optimization
- Authors: Fernando Spadea, Oshani Seneviratne,
- Abstract summary: We evaluate Kahneman-Tversky Optimization (KTO) as a fine-tuning method for large language models (LLMs) in federated learning (FL) settings.
In both its original (KTOO) and redistributed (KTOR) configurations, KTO consistently outperforms DPO across all benchmarks.
These findings establish KTO as a robust and scalable fine-tuning method for FL, motivating its adoption for privacy-preserving, decentralized, and heterogeneous environments.
- Score: 49.88778604259453
- License:
- Abstract: We evaluate Kahneman-Tversky Optimization (KTO) as a fine-tuning method for large language models (LLMs) in federated learning (FL) settings, comparing it against Direct Preference Optimization (DPO). Using Alpaca-7B as the base model, we fine-tune on a realistic dataset under both methods and evaluate performance using MT-Bench-1, Vicuna, and AdvBench benchmarks. Additionally, we introduce a redistributed dataset setup, where only KTO is applicable due to its ability to handle single-response feedback, unlike DPO's reliance on paired responses. Our results demonstrate that KTO, in both its original (KTOO) and redistributed (KTOR) configurations, consistently outperforms DPO across all benchmarks. In the redistributed setup, KTO further validates its flexibility and resilience by maintaining superior performance in scenarios where DPO cannot be applied. These findings establish KTO as a robust and scalable fine-tuning method for FL, motivating its adoption for privacy-preserving, decentralized, and heterogeneous environments.
Related papers
- Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator [32.05337749590184]
We develop a novel PO framework that provides theoretical guidance to effectively sample dispreferred completions.
We then select contrastive divergence (CD) as sampling strategy, and propose a novel MC-PO algorithm.
OnMC-PO outperforms existing SOTA baselines, and OnMC-PO leads to further improvement.
arXiv Detail & Related papers (2025-02-06T23:45:08Z) - Calibrated Multi-Preference Optimization for Aligning Diffusion Models [92.90660301195396]
Calibrated Preference Optimization (CaPO) is a novel method to align text-to-image (T2I) diffusion models.
CaPO incorporates the general preference from multiple reward models without human annotated data.
Experimental results show that CaPO consistently outperforms prior methods.
arXiv Detail & Related papers (2025-02-04T18:59:23Z) - TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights [73.9088920210495]
We propose a token-level importance sampling DPO objective named TIS-DPO that assigns importance weights to each token based on its reward.
TIS-DPO significantly outperforms various baseline methods on harmlessness and helpfulness alignment and summarization tasks.
arXiv Detail & Related papers (2024-10-06T04:03:00Z) - ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood [14.512464277772194]
Aligned Supervised Fine-Tuning (ASFT) is an effective approach that better aligns Large Language Models with pair-wise datasets.
ASFT mitigates the issue where the DPO loss function decreases the probability of generating human-dispreferred data.
Extensive experiments demonstrate that ASFT is an effective alignment approach, consistently outperforming existing methods.
arXiv Detail & Related papers (2024-09-14T11:39:13Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.
We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.
We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - DPO Meets PPO: Reinforced Token Optimization for RLHF [35.638723885233475]
We introduce an algorithm that learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal.
Experiments demonstrate that textttRTO performs better than PPO and other direct preference learning algorithms.
arXiv Detail & Related papers (2024-04-29T17:58:30Z) - Token-level Direct Preference Optimization [8.249403373337024]
Fine-tuning pre-trained Large Language Models is essential to align them with human values and intentions.
We introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level.
arXiv Detail & Related papers (2024-04-18T08:49:38Z) - Fine-Tuning Language Models with Advantage-Induced Policy Alignment [80.96507425217472]
We propose a novel algorithm for aligning large language models to human preferences.
We show that it consistently outperforms PPO in language tasks by a large margin.
We also provide a theoretical justification supporting the design of our loss function.
arXiv Detail & Related papers (2023-06-04T01:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.