FuseRL: Dense Preference Optimization for Heterogeneous Model Fusion
- URL: http://arxiv.org/abs/2504.06562v2
- Date: Thu, 17 Apr 2025 09:49:43 GMT
- Title: FuseRL: Dense Preference Optimization for Heterogeneous Model Fusion
- Authors: Longguang Zhong, Fanqi Wan, Ziyi Yang, Guosheng Liang, Tianyuan Shi, Xiaojun Quan,
- Abstract summary: We propose a two-stage framework comprising FuseSFT and FusePO to maximize the utilization of source LLMs.<n>FusePO optimize weighted preferences based on the outputs of multiple source models to enable superior alignment performance.<n>Our approach achieves state-of-the-art performance among 8B LLMs on the AlpacaEval-2 and Arena-Hard benchmarks.
- Score: 33.5714726499406
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Heterogeneous model fusion enhances the performance of LLMs by integrating the knowledge and capabilities of multiple structurally diverse models. However, existing approaches often rely solely on selecting the best output for each prompt from source models, which underutilizes their full potential due to limited source knowledge and results in sparse optimization signals. To address this limitation, we propose FuseRL, a novel two-stage framework comprising FuseSFT and FusePO to maximize the utilization of source LLMs. FuseSFT establishes a robust initialization by integrating the strengths of heterogeneous source models through weighted supervised fine-tuning (SFT) on diverse outputs for each prompt. FusePO optimizes weighted preferences based on the outputs of multiple source models to enable superior alignment performance. Extensive experiments demonstrate the effectiveness of our framework across various preference alignment methods, including RLOO, DPO, and SimPO. Using Llama-3.1-8B-Instruct as the target model, our approach achieves state-of-the-art performance among 8B LLMs on the AlpacaEval-2 and Arena-Hard benchmarks. Further analysis suggests that FuseSFT regularizes the training process to reduce overfitting, while FusePO introduces dense and diverse signals for preference optimization.
Related papers
- Divergence Minimization Preference Optimization for Diffusion Model Alignment [58.651951388346525]
Divergence Minimization Preference Optimization (DMPO) is a principled method for aligning diffusion models by minimizing reverse KL divergence.<n>Our results show that diffusion models fine-tuned with DMPO can consistently outperform or match existing techniques.<n>DMPO unlocks a robust and elegant pathway for preference alignment, bridging principled theory with practical performance in diffusion models.
arXiv Detail & Related papers (2025-07-10T07:57:30Z) - Composable Cross-prompt Essay Scoring by Merging Models [7.5702468122067685]
Cross-prompt automated essay scoring typically trains models jointly on all source prompts.<n>We propose a source-free adaptation approach that selectively merges individually trained source models' parameters instead of datasets.
arXiv Detail & Related papers (2025-05-24T06:28:21Z) - InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models [36.27704594180795]
InfiFPO is a preference optimization method for implicit model fusion.<n>It enables the pivot model to align with human preferences while effectively distilling knowledge from source models.<n>It significantly improves its capabilities in mathematics, coding, and reasoning tasks.
arXiv Detail & Related papers (2025-05-20T03:32:37Z) - Preference Optimization for Combinatorial Optimization Problems [54.87466279363487]
Reinforcement Learning (RL) has emerged as a powerful tool for neural optimization, enabling models learns that solve complex problems without requiring expert knowledge.<n>Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast action spaces.<n>We propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling.
arXiv Detail & Related papers (2025-05-13T16:47:00Z) - Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes [54.93980123979578]
We introduce Latent Preference Coding (LPC), a novel framework that models the implicit factors as well as their combinations behind holistic preferences.<n>LPC seamlessly integrates with various offline alignment algorithms, automatically inferring the underlying factors and their importance from data.
arXiv Detail & Related papers (2025-05-08T06:59:06Z) - Preference-Guided Diffusion for Multi-Objective Offline Optimization [64.08326521234228]
We propose a preference-guided diffusion model for offline multi-objective optimization.<n>Our guidance is a preference model trained to predict the probability that one design dominates another.<n>Our results highlight the effectiveness of classifier-guided diffusion models in generating diverse and high-quality solutions.
arXiv Detail & Related papers (2025-03-21T16:49:38Z) - Make Optimization Once and for All with Fine-grained Guidance [78.14885351827232]
Learning to Optimize (L2O) enhances optimization efficiency with integrated neural networks.
L2O paradigms achieve great outcomes, e.g., refitting, generating unseen solutions iteratively or directly.
Our analyses explore general framework for learning optimization, called Diff-L2O, focusing on augmenting solutions from a wider view.
arXiv Detail & Related papers (2025-03-14T14:48:12Z) - DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models [50.32663816994459]
Diffusion-styled Preference Optimization (model) provides an efficient and policy-agnostic solution for aligning LLMs with humans.<n>modelavoids the time latency associated with token-level generation.<n>Experiments on AlpacaEval 2, MT-bench, and HH-RLHF demonstrate that modelachieves superior alignment performance across various settings.
arXiv Detail & Related papers (2025-03-06T09:21:54Z) - Calibrated Multi-Preference Optimization for Aligning Diffusion Models [92.90660301195396]
Calibrated Preference Optimization (CaPO) is a novel method to align text-to-image (T2I) diffusion models.
CaPO incorporates the general preference from multiple reward models without human annotated data.
Experimental results show that CaPO consistently outperforms prior methods.
arXiv Detail & Related papers (2025-02-04T18:59:23Z) - InfiFusion: A Unified Framework for Enhanced Cross-Model Reasoning via LLM Fusion [35.98702433016698]
InfiFusion is an efficient training pipeline designed to integrate domain-specialized Large Language Models (LLMs) into a single pivot model.<n>We propose two fusion strategies: Pairwise Fusion (InfiFusion$_p$) and Unified Fusion (InfiFusion$_u$)<n>InfiFusion outperforms the state-of-the-art models, such as Qwen-2.5-14B-Instruct and Phi-4, across 11 widely applied benchmarks.
arXiv Detail & Related papers (2025-01-06T06:29:55Z) - Weighted-Reward Preference Optimization for Implicit Model Fusion [35.57286356489511]
We propose an implicit fusion method, which leverages preference optimization between the source LLMs and the target LLM to transfer their capabilities effectively.
WRPO eliminates the need for vocabulary alignment and matrix fusion and can be efficiently scaled to accommodate various LLMs.
Experiments on the MT-Bench, AlpacaEval-2, and Arena-Hard benchmarks demonstrate that WRPO consistently outperforms existing knowledge fusion methods.
arXiv Detail & Related papers (2024-12-04T10:15:12Z) - Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [65.64108848398696]
We introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs.<n>Specifically, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset.<n>We explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance.
arXiv Detail & Related papers (2024-11-15T18:59:27Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - Expensive Multi-Objective Bayesian Optimization Based on Diffusion Models [17.19004913553654]
Multi-objective Bayesian optimization (MOBO) has shown promising performance on various expensive multi-objective optimization problems (EMOPs)
We propose a novel Composite Diffusion Model based Pareto Set Learning algorithm, namely CDM-PSL, for expensive MOBO.
Our proposed algorithm attains superior performance compared with various state-of-the-art MOBO algorithms.
arXiv Detail & Related papers (2024-05-14T14:55:57Z) - End-to-End Stochastic Optimization with Energy-Based Model [18.60842637575249]
Decision-focused learning (DFL) was recently proposed for objective optimization problems that involve unknown parameters.
We propose SO-EBM, a general and efficient DFL method for layer optimization using energy-based models.
arXiv Detail & Related papers (2022-11-25T00:14:12Z) - Leveraging Trust for Joint Multi-Objective and Multi-Fidelity
Optimization [0.0]
This paper investigates a novel approach to Bayesian multi-objective and multi-fidelity (MOMF) optimization.
We suggest the innovative use of a trust metric to support simultaneous optimization of multiple objectives and data sources.
Our methods offer broad applicability in solving simulation problems in fields such as plasma physics and fluid dynamics.
arXiv Detail & Related papers (2021-12-27T20:55:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.