Related papers: $H^3$Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs

$H^3$Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs

URL: http://arxiv.org/abs/2411.17792v1
Date: Tue, 26 Nov 2024 17:42:38 GMT
Title: $H^3$Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs
Authors: Selim Furkan Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, Zachary Yahn, Ling Liu,
Abstract summary: Alignment of pretrained LLMs using instruction-based datasets is critical for creating fine-tuned models that reflect human preference.<n>This paper develops an alignment fusion approach, coined as $H3$Fusion, with three unique characteristics.<n>It outperforms each individually aligned model by $11.37%$, and it provides stronger robustness compared to the state-of-the-art LLM ensemble approaches by $13.77%$.
Score: 7.498844064516196
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Alignment of pretrained LLMs using instruction-based datasets is critical for creating fine-tuned models that reflect human preference. A growing number of alignment-based fine-tuning algorithms and benchmarks emerged recently, fueling the efforts on effective alignments of pre-trained LLMs to ensure helpful, harmless, and honest answers from both open-source and closed-source LLMs. This paper tackles this problem by developing an alignment fusion approach, coined as $H^3$Fusion, with three unique characteristics. First, $H^3$Fusion ensembles multiple individually aligned LLMs to create a final fine-tuned alignment model with enhanced capabilities beyond those of individual models, delivering robust alignment through promoting helpful, harmless, honest fusion. Second, $H^3$Fusion leverages the mixture-of-experts (MoE) methodology in two steps. We first freeze the multi-head attention weights of each individual model while tuning the FFN layer during alignment fusion. Then we merge the aligned model weights with an expert router according to the type of input instruction and dynamically select a subset of experts that are best suited for producing the output response. Finally, we boost the performance of the resulting $H^3$3Fusion model by introducing gating loss and regularization terms. The former penalizes the selection errors of the expert-router, and the latter mediates the expert weights drifting during fine-tuning and dynamically adjusts the fusion behavior of the resulting model by canalizing the activations on the experts. Extensive evaluations on three benchmark datasets show that $H^3$3Fusion is more helpful, less harmful, and more honest from two aspects: it outperforms each individually aligned model by $11.37\%$, and it provides stronger robustness compared to the state-of-the-art LLM ensemble approaches by $13.77\%$. Code is available at github.com/sftekin/h3fusion.

Related papers

SRMIR: Shadow Reward Models Based on Introspective Reasoning for LLM Alignment [0.0]
SRMIR (Shadow Reward Models Based on Introspective Reasoning) is inspired by shadow models in membership inference attacks. We apply two strategies, linear combination and categorized approach, to integrate shadow reward models for policy optimization.
arXiv Detail & Related papers (2025-03-23T16:40:29Z)
Mixup Model Merge: Enhancing Model Merging Performance through Randomized Linear Interpolation [15.47711837051754]
We propose Mixup Model Merge, an innovative approach inspired by the Mixup data augmentation technique. M$3$ is a simple yet effective model merging method that significantly enhances the performance of the merged model.
arXiv Detail & Related papers (2025-02-21T13:01:26Z)
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs. We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms. Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z)
InfiFusion: A Unified Framework for Enhanced Cross-Model Reasoning via LLM Fusion [35.98702433016698]
InfiFusion is an efficient training pipeline designed to integrate domain-specialized Large Language Models (LLMs) into a single pivot model. We propose two fusion strategies: Pairwise Fusion (InfiFusion$_p$) and Unified Fusion (InfiFusion$_u$) InfiFusion outperforms the state-of-the-art models, such as Qwen-2.5-14B-Instruct and Phi-4, across 11 widely applied benchmarks.
arXiv Detail & Related papers (2025-01-06T06:29:55Z)
S$^{2}$FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity [39.679861450783605]
We propose a family of Structured Sparse Fine-Tuning (S$2$FT) methods for LLMs. S$2$FT accomplishes this by "selecting sparsely and computing densely" We show that S$2$FT saves training memory up to 3$times$ and improves latency by 1.5-2.7$times$ compared to full FT.
arXiv Detail & Related papers (2024-12-09T08:24:11Z)
Extend Model Merging from Fine-Tuned to Pre-Trained Large Language Models via Weight Disentanglement [72.97553348776425]
We make a pioneering effort to broaden the applicability of merging techniques from FT to PT LLMs. We introduce an approach based on WeIght DisENtanglement (WIDEN) to effectively extend the merging scope. We merge Qwen1.5-Chat (an FT LLM with instruction-following skills) with Sailor (a PT LLM with multilingual abilities) across 7B and 14B model scales.
arXiv Detail & Related papers (2024-08-06T10:46:46Z)
Cool-Fusion: Fuse Large Language Models without Training [73.17551121242602]
emphCool-Fusion is a method that does not require any type of training like the ensemble approaches. emphCool-Fusion increases accuracy from three strong source LLMs by a significant 8%-17.8%.
arXiv Detail & Related papers (2024-07-29T09:02:19Z)
Decoding-Time Language Model Alignment with Multiple Objectives [116.42095026960598]
Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives. Here, we propose $textbfmulti-objective decoding (MOD)$, a decoding-time algorithm that outputs the next token from a linear combination of predictions. We show why existing approaches can be sub-optimal even in natural settings and obtain optimality guarantees for our method.
arXiv Detail & Related papers (2024-06-27T02:46:30Z)
Model Merging and Safety Alignment: One Bad Model Spoils the Bunch [70.614652904151]
Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model. Current approaches often overlook the importance of safety alignment during merging, leading to highly misaligned models. We evaluate several popular model merging techniques, demonstrating that existing methods do not only transfer domain expertise but also propagate misalignment.
arXiv Detail & Related papers (2024-06-20T17:59:58Z)
CURATRON: Complete and Robust Preference Data for Rigorous Alignment of Large Language Models [1.6339731044538859]
This paper addresses the challenges of aligning large language models with human values via preference learning. We propose a novel method for robustly and maliciously manipulated AI pipeline datasets to enhance LLMs' resilience.
arXiv Detail & Related papers (2024-03-05T07:58:12Z)
Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error. We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.