$H^3$Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs
- URL: http://arxiv.org/abs/2411.17792v1
- Date: Tue, 26 Nov 2024 17:42:38 GMT
- Title: $H^3$Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs
- Authors: Selim Furkan Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, Zachary Yahn, Ling Liu,
- Abstract summary: Alignment of pretrained LLMs using instruction-based datasets is critical for creating fine-tuned models that reflect human preference.<n>This paper develops an alignment fusion approach, coined as $H3$Fusion, with three unique characteristics.<n>It outperforms each individually aligned model by $11.37%$, and it provides stronger robustness compared to the state-of-the-art LLM ensemble approaches by $13.77%$.
- Score: 7.498844064516196
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Alignment of pretrained LLMs using instruction-based datasets is critical for creating fine-tuned models that reflect human preference. A growing number of alignment-based fine-tuning algorithms and benchmarks emerged recently, fueling the efforts on effective alignments of pre-trained LLMs to ensure helpful, harmless, and honest answers from both open-source and closed-source LLMs. This paper tackles this problem by developing an alignment fusion approach, coined as $H^3$Fusion, with three unique characteristics. First, $H^3$Fusion ensembles multiple individually aligned LLMs to create a final fine-tuned alignment model with enhanced capabilities beyond those of individual models, delivering robust alignment through promoting helpful, harmless, honest fusion. Second, $H^3$Fusion leverages the mixture-of-experts (MoE) methodology in two steps. We first freeze the multi-head attention weights of each individual model while tuning the FFN layer during alignment fusion. Then we merge the aligned model weights with an expert router according to the type of input instruction and dynamically select a subset of experts that are best suited for producing the output response. Finally, we boost the performance of the resulting $H^3$3Fusion model by introducing gating loss and regularization terms. The former penalizes the selection errors of the expert-router, and the latter mediates the expert weights drifting during fine-tuning and dynamically adjusts the fusion behavior of the resulting model by canalizing the activations on the experts. Extensive evaluations on three benchmark datasets show that $H^3$3Fusion is more helpful, less harmful, and more honest from two aspects: it outperforms each individually aligned model by $11.37\%$, and it provides stronger robustness compared to the state-of-the-art LLM ensemble approaches by $13.77\%$. Code is available at github.com/sftekin/h3fusion.
Related papers
- Rethinking LoRA for Privacy-Preserving Federated Learning in Large Models [14.755143405057929]
Fine-tuning large vision models (LVMs) and large language models (LLMs) under differentially private learning (DPFL) is hindered by a fundamental privacy-utility trade-off.<n>Low-Rank Adaptation (LoRA), a promising parameter-efficient fine-tuning (PEFT) method, reduces computational and communication costs by introducing two trainable low-rank matrices while freezing pre-trained weights.<n>We propose LA-LoRA, a novel approach that decouples gradient interactions and aligns update directions across clients to enhance robustness under stringent privacy constraints.
arXiv Detail & Related papers (2026-02-23T15:05:28Z) - MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning [16.012761588513026]
Reinforcement Learning with Verifiable Rewards (RLVR) algorithms rely on rigid, uniform, and symmetric trust region mechanisms.<n>We propose Mass-Adaptive Soft Policy Optimization (MASPO), a unified framework designed to harmonize these three dimensions.<n> MASPO integrates a differentiable soft Gaussian gating to maximize gradient utility, a mass-adaptive limiter to balance exploration across the probability spectrum, and an asymmetric risk controller to align update magnitudes with signal confidence.
arXiv Detail & Related papers (2026-02-19T17:05:20Z) - Dual-Branch Center-Surrounding Contrast: Rethinking Contrastive Learning for 3D Point Clouds [55.5576033344795]
We propose a novel DualBranch textbfCentertextbfSurrounding textbfContrast (CSCon) framework for 3D point clouds.<n>Under the FULL and ALL protocols, CSCon achieves performance comparable to generative methods.<n>Our method attains state-of-the-art results, even surpassing cross-modal approaches.
arXiv Detail & Related papers (2025-12-09T14:56:35Z) - DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models [55.30555646945055]
Text-to-Image (T2I) models are vulnerable to semantic leakage.<n>We introduce DeLeaker, a lightweight approach that mitigates leakage by directly intervening on the model's attention maps.<n>SLIM is the first dataset dedicated to semantic leakage.
arXiv Detail & Related papers (2025-10-16T17:39:21Z) - Too Helpful, Too Harmless, Too Honest or Just Right? [19.134202394422285]
Large Language Models (LLMs) exhibit strong performance across a wide range of NLP tasks.<n> aligning their outputs with the principles of Helpfulness, Harmlessness, and Honesty (HHH) remains a persistent challenge.<n>We propose TrinityX, a modular alignment framework that incorporates a Mixture of Calibrated Experts (MoCaE) within the Transformer architecture.
arXiv Detail & Related papers (2025-09-10T10:51:47Z) - SRMIR: Shadow Reward Models Based on Introspective Reasoning for LLM Alignment [0.0]
SRMIR (Shadow Reward Models Based on Introspective Reasoning) is inspired by shadow models in membership inference attacks.
We apply two strategies, linear combination and categorized approach, to integrate shadow reward models for policy optimization.
arXiv Detail & Related papers (2025-03-23T16:40:29Z) - Learning to Align and Refine: A Foundation-to-Diffusion Framework for Occlusion-Robust Two-Hand Reconstruction [50.952228546326516]
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures.<n>Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts.<n>We propose a dual-stage Foundation-to-Diffusion framework that precisely align 2D prior guidance from vision foundation models.
arXiv Detail & Related papers (2025-03-22T14:42:27Z) - Mixup Model Merge: Enhancing Model Merging Performance through Randomized Linear Interpolation [15.47711837051754]
We propose Mixup Model Merge, an innovative approach inspired by the Mixup data augmentation technique.
M$3$ is a simple yet effective model merging method that significantly enhances the performance of the merged model.
arXiv Detail & Related papers (2025-02-21T13:01:26Z) - MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs.
We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms.
Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z) - Mix Data or Merge Models? Balancing the Helpfulness, Honesty, and Harmlessness of Large Language Model via Model Merging [36.00016254809852]
This paper systematically compares the effectiveness of model merging and data mixture methods in constructing 3H-aligned LLMs.<n>We propose a novel textbfReweighting textbfEnhanced task textbfSingular textbfMerging method, textbfRESM, through outlier weighting and sparsity-aware rank selection strategies.
arXiv Detail & Related papers (2025-02-08T11:56:58Z) - The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility.<n>This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance.<n>We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv Detail & Related papers (2025-01-20T06:35:01Z) - InfiFusion: A Unified Framework for Enhanced Cross-Model Reasoning via LLM Fusion [35.98702433016698]
InfiFusion is an efficient training pipeline designed to integrate domain-specialized Large Language Models (LLMs) into a single pivot model.
We propose two fusion strategies: Pairwise Fusion (InfiFusion$_p$) and Unified Fusion (InfiFusion$_u$)
InfiFusion outperforms the state-of-the-art models, such as Qwen-2.5-14B-Instruct and Phi-4, across 11 widely applied benchmarks.
arXiv Detail & Related papers (2025-01-06T06:29:55Z) - GMoE: Empowering LLMs Fine-Tuning via MoE Graph Collaboration [39.302800055216764]
We introduce a novel MoE graph-based framework $textbfGMoE$, aimed at enhancing the collaboration among multiple experts.<n>In GMoE, a graph router function is designed to capture the collaboration signals among experts.<n>We put forward two coordination strategies in GMoE: the $textitPoisson distribution-based distinction strategy and the $textitNormal distribution-based balance strategy.
arXiv Detail & Related papers (2024-12-18T02:18:57Z) - S$^{2}$FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity [39.679861450783605]
We propose a family of Structured Sparse Fine-Tuning (S$2$FT) methods for LLMs.
S$2$FT accomplishes this by "selecting sparsely and computing densely"
We show that S$2$FT saves training memory up to 3$times$ and improves latency by 1.5-2.7$times$ compared to full FT.
arXiv Detail & Related papers (2024-12-09T08:24:11Z) - Extend Model Merging from Fine-Tuned to Pre-Trained Large Language Models via Weight Disentanglement [72.97553348776425]
We make a pioneering effort to broaden the applicability of merging techniques from FT to PT LLMs.
We introduce an approach based on WeIght DisENtanglement (WIDEN) to effectively extend the merging scope.
We merge Qwen1.5-Chat (an FT LLM with instruction-following skills) with Sailor (a PT LLM with multilingual abilities) across 7B and 14B model scales.
arXiv Detail & Related papers (2024-08-06T10:46:46Z) - Cool-Fusion: Fuse Large Language Models without Training [73.17551121242602]
emphCool-Fusion is a method that does not require any type of training like the ensemble approaches.
emphCool-Fusion increases accuracy from three strong source LLMs by a significant 8%-17.8%.
arXiv Detail & Related papers (2024-07-29T09:02:19Z) - Decoding-Time Language Model Alignment with Multiple Objectives [116.42095026960598]
Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives.
Here, we propose $textbfmulti-objective decoding (MOD)$, a decoding-time algorithm that outputs the next token from a linear combination of predictions.
We show why existing approaches can be sub-optimal even in natural settings and obtain optimality guarantees for our method.
arXiv Detail & Related papers (2024-06-27T02:46:30Z) - Model Merging and Safety Alignment: One Bad Model Spoils the Bunch [70.614652904151]
Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model.
Current approaches often overlook the importance of safety alignment during merging, leading to highly misaligned models.
We evaluate several popular model merging techniques, demonstrating that existing methods do not only transfer domain expertise but also propagate misalignment.
arXiv Detail & Related papers (2024-06-20T17:59:58Z) - CURATRON: Complete and Robust Preference Data for Rigorous Alignment of Large Language Models [1.6339731044538859]
This paper addresses the challenges of aligning large language models with human values via preference learning.
We propose a novel method for robustly and maliciously manipulated AI pipeline datasets to enhance LLMs' resilience.
arXiv Detail & Related papers (2024-03-05T07:58:12Z) - Mitigating the Alignment Tax of RLHF [76.4300447532456]
aligning LLMs under Reinforcement Learning with Human Feedback can lead to forgetting pretrained abilities, also known as the alignment tax.
We propose model averaging to maximize alignment performance while incurring minimal alignment tax.
We validate HMA's performance across a range of RLHF algorithms over OpenLLaMA-3B and further extend our findings to Mistral-7B.
arXiv Detail & Related papers (2023-09-12T14:16:54Z) - Sparse Conditional Hidden Markov Model for Weakly Supervised Named
Entity Recognition [68.68300358332156]
We propose the sparse conditional hidden Markov model (Sparse-CHMM) to evaluate noisy labeling functions.
Sparse-CHMM is optimized through unsupervised learning with a three-stage training pipeline.
It achieves a 3.01 average F1 score improvement on five comprehensive datasets.
arXiv Detail & Related papers (2022-05-27T20:47:30Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.