Related papers: MidPO: Dual Preference Optimization for Safety and Helpfulness in Large Language Models via a Mixture of Experts Framework

MidPO: Dual Preference Optimization for Safety and Helpfulness in Large Language Models via a Mixture of Experts Framework

URL: http://arxiv.org/abs/2506.02460v1
Date: Tue, 03 Jun 2025 05:23:09 GMT
Title: MidPO: Dual Preference Optimization for Safety and Helpfulness in Large Language Models via a Mixture of Experts Framework
Authors: Yupeng Qi, Ziyu Lyu, Min Yang, Yanlin Wang, Lu Bai, Lixin Cui,
Abstract summary: We propose MidPO, a textbfunderlineMixture of Experts (MoE) framework for safety-helpfulness.<n>We conduct quantitative and qualitative experiments on three popular datasets to demonstrate the proposed MidPO significantly outperforms state-of-the-art approaches in both safety and helpfulness.
Score: 20.141606392837478
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) are increasingly applied across various domains, enhancing safety while maintaining the helpfulness of LLMs has become a critical challenge. Recent studies solve this problem through safety-constrained online preference optimization or safety-constrained offline preference optimization. However, the safety-constrained online methods often suffer from excessive safety, which might reduce helpfulness, while the safety-constrained offline methods perform poorly in adaptively balancing safety and helpfulness. To address these limitations, we propose MidPO, a \textbf{\underline{Mi}}xture of Experts (MoE) framework for safety-helpfulness \textbf{\underline{d}}ual \textbf{\underline{P}}reference \textbf{\underline{O}}ptimization. Firstly, MidPO devises single-preference enhanced direct preference optimization approach to transform the base model into two independent experts, termed safety and helpfulness experts, and fine-tunes the two independent experts for optimal safety or helpfulness performance. Secondly, to achieve an effective balance between safety and helpfulness, MidPO incorporates the two experts into the MoE framework and designs a dynamic routing mechanism to allocate contributions from each expert adaptively. We conduct quantitative and qualitative experiments on three popular datasets to demonstrate the proposed MidPO significantly outperforms state-of-the-art approaches in both safety and helpfulness. The code and models will be released.

Related papers

SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety [57.14003339251827]
We introduce a new algorithm called SafeDPO, which is designed to directly optimize the safety alignment objective in a single stage of policy learning.<n>As a result, it eliminates the need to fit separate reward and cost models or to sample from the language model during fine-tuning.<n>We demonstrate that SafeDPO achieves competitive performance compared to state-of-the-art safety alignment algorithms.
arXiv Detail & Related papers (2025-05-26T14:50:01Z)
ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization [36.609297811592185]
Ex-Ante Reasoning Preference Optimization (ERPO) is a novel safety alignment framework for large language models.<n>Our approach consists of three stages: first, equipping the model with Ex-Ante reasoning through supervised fine-tuning (SFT); second, enhancing safety, usefulness, and efficiency via Direct Preference Optimization (DPO); and third, mitigating inference latency with a length-controlled iterative preference optimization strategy.
arXiv Detail & Related papers (2025-04-03T16:07:38Z)
Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback [34.01716144973483]
Multimodal large language models (MLLMs) are essential for building general-purpose AI assistants.<n>How can we ensure safety alignment of MLLMs to prevent undesired behaviors?<n>In this work, we present the first exploration of the Safe RLHF-V -- the first multimodal safety alignment framework.
arXiv Detail & Related papers (2025-03-22T07:40:20Z)
Efficient and Robust Regularized Federated Recommendation [52.24782464815489]
The recommender system (RSRS) addresses both user preference and privacy concerns. We propose a novel method that incorporates non-uniform gradient descent to improve communication efficiency. RFRecF's superior robustness compared to diverse baselines.
arXiv Detail & Related papers (2024-11-03T12:10:20Z)
Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization [16.35399722653875]
We propose Rectified Policy Optimization (RePO) to balance helpfulness and safety (harmlessness) in large language models (LLMs)<n>At the core of RePO is a policy update mechanism driven by rectified policy gradients, which penalizes the strict safety violation of every prompt, thereby enhancing safety across nearly all prompts.
arXiv Detail & Related papers (2024-10-25T19:08:23Z)
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models [94.39278422567955]
Fine-tuning large language models (LLMs) on human preferences has proven successful in enhancing their capabilities.<n>However, ensuring the safety of LLMs during fine-tuning remains a critical concern.<n>We propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO) to address this issue.
arXiv Detail & Related papers (2024-08-27T17:31:21Z)
One-Shot Safety Alignment for Large Language Models via Optimal Dualization [64.52223677468861]
This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem. We do so by pre-optimizing a smooth and convex dual function that has a closed form. Our strategy leads to two practical algorithms in model-based and preference-based settings.
arXiv Detail & Related papers (2024-05-29T22:12:52Z)
Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching [74.62818936088065]
textscSafePatching is a novel framework for comprehensive PSA.<n>textscSafePatching achieves a more comprehensive PSA than baseline methods.<n>textscSafePatching demonstrates its superiority in continual PSA scenarios.
arXiv Detail & Related papers (2024-05-22T16:51:07Z)
Enhancing LLM Safety via Constrained Direct Preference Optimization [8.22888921018027]
We introduce Constrained DPO (C-DPO), a novel extension of the recently proposed Direct Preference Optimization (DPO) approach for fine-tuning AI systems. By integrating dual gradient descent and DPO, our method identifies a nearly optimal trade-off between helpfulness and harmlessness without using reinforcement learning. Empirically, our approach provides a safety guarantee to LLMs that is missing in DPO while achieving significantly higher rewards under the same safety constraint.
arXiv Detail & Related papers (2024-03-04T20:39:24Z)
Constrained Policy Optimization via Bayesian World Models [79.0077602277004]
LAMBDA is a model-based approach for policy optimization in safety critical tasks modeled via constrained Markov decision processes. We demonstrate LAMBDA's state of the art performance on the Safety-Gym benchmark suite in terms of sample efficiency and constraint violation.
arXiv Detail & Related papers (2022-01-24T17:02:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.