Related papers: Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization

Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization

URL: http://arxiv.org/abs/2510.08256v1
Date: Thu, 09 Oct 2025 14:15:14 GMT
Title: Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization
Authors: Jason Bohne, Pawel Polak, David Rosenberg, Brian Bloniarz, Gary Kazantsev,
Abstract summary: We propose Mix- and MoE-DPO, a framework that extends DPO with both soft mixture models and mixture-of-experts.<n>Our framework supports both shared base architectures with expert-specific policy heads and fully independent expert models.<n>We validate our approach on a variety of model sizes and multi-preference datasets.
Score: 2.1487222438373674
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Direct Preference Optimization (DPO) has recently emerged as a simple and effective alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with user preferences. However, existing DPO formulations rely on a single monolithic model, which limits their expressivity in multi-task settings and their adaptability to heterogeneous or diverse preference distributions. In this work, we propose Mix- and MoE-DPO, a framework that extends DPO with both soft mixture models and mixture-of-experts (MoE) architectures, using a stochastic variational inference approach. Our method introduces a latent-variable model over expert assignments and optimizes a variational evidence lower bound (ELBO), enabling stable and efficient learning of specialized expert policies from preference data. Mix- and MoE-DPO provides three key advantages over standard DPO: (i) generalization via universal function approximation through mixtures; (ii) reward and policy specialization through expert components tailored to distinct preference modes; and (iii) contextual alignment through input-dependent soft gating that enables user-specific mixture policies. Our framework supports both shared base architectures with expert-specific policy heads and fully independent expert models, allowing flexible trade-offs between parameter efficiency and specialization. We validate our approach on a variety of model sizes and multi-preference datasets, demonstrating that Mix- and MoE-DPO offers a powerful and scalable method for preference-based LLM alignment.

Related papers

MixDPO: Modeling Preference Strength for Pluralistic Alignment [24.622787481918863]
We introduce Mixed Logit Direct Preference Optimization (MixDPO), a generalization of Direct Preference Optimization that models variation in preference strength.<n>We evaluate MixDPO on three preference datasets using two open-weight language models.
arXiv Detail & Related papers (2026-01-07T16:57:43Z)
Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation [60.33386541343322]
We propose a Multimodal Large Language Models framework that integrates Hardness-aware and Noise-regularized preference optimization for Recommendation (HaNoRec)<n>Specifically, HaNoRec dynamically adjusts optimization weights based on both the estimated hardness of each training sample and the policy model's real-time responsiveness.
arXiv Detail & Related papers (2025-11-24T04:10:46Z)
M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following [4.119014132092875]
Large Vision-Language Models (LVLMs) hold immense potential for complex multimodal instruction following.<n>M3PO is a novel and data-efficient method designed to enhance LVLMs' capabilities in visual instruction following.<n>M3PO intelligently selects the most "learning-valuable" preference sample pairs from a diverse pool of LVLM-generated candidates.
arXiv Detail & Related papers (2025-08-17T18:07:55Z)
MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge [35.703451475662995]
We propose Maximum a Posteriori Preference Optimization (MaPPO), a framework for learning from preferences.<n>MaPPO integrates prior reward estimates into a principled Maximum a Posteriori (MaP) objective.<n>MaPPO can be used as a plugin with consistent improvement on DPO variants.
arXiv Detail & Related papers (2025-07-27T05:26:50Z)
Multi-Preference Lambda-weighted Listwise DPO for Small-Scale Model Alignment [5.276657230880984]
Large language models (LLMs) demonstrate strong generalization across a wide range of language tasks, but often generate outputs that misalign with human preferences.<n>Direct Optimization Preference (DPO) simplifies the process by treating alignment as a classification task over binary preference pairs.<n>We propose Multi-Preference Lambda-weighted Listwise DPO, which allows the model to learn from more detailed human feedback.<n>Our method consistently outperforms standard DPO on alignment while enabling efficient, controllable, and fine-grained adaptation suitable for real-world deployment.
arXiv Detail & Related papers (2025-06-24T16:47:17Z)
AMoPO: Adaptive Multi-objective Preference Optimization without Reward Models and Reference Models [18.249363312256722]
AMoPO is a novel framework that achieves dynamic balance across preference dimensions.<n>We introduce the multi-objective optimization paradigm to use the dimension-aware generation metrics as implicit rewards.<n> Empirical results demonstrate that AMoPO outperforms state-of-the-art baselines by 28.5%.
arXiv Detail & Related papers (2025-06-08T14:31:06Z)
Robust Multi-Objective Preference Alignment with Online DPO [6.434799451791957]
Multi-objective preference alignment is critical for developing AI systems that are personalizable, helpful, and safe.<n>Existing approaches are either computationally expensive to train or do not sufficiently steer model behaviors.<n>This paper introduces the Multi-Objective Online DPO algorithm, designed to robustly and efficiently align model behaviors with multiple, potentially conflicting human preferences.
arXiv Detail & Related papers (2025-03-01T02:01:49Z)
Personalized Preference Fine-tuning of Diffusion Models [75.22218338096316]
We introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences.<n>With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way.<n>Our approach achieves an average win rate of 76% over Stable Cascade, generating images that more accurately reflect specific user preferences.
arXiv Detail & Related papers (2025-01-11T22:38:41Z)
Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications.<n> Ensuring their alignment with the diverse preferences of individual users has become a critical challenge.<n>We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
Unified Preference Optimization: Language Model Alignment Beyond the Preference Frontier [0.5120567378386615]
We propose a unified approach to aligning large language models (LLMs)<n>Based on a simple decomposition of preference and auxiliary objectives, we allow for tuning LLMs to optimize user and designer preferences.
arXiv Detail & Related papers (2024-05-28T08:35:48Z)
Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models. The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models. Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z)
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization [76.09576643028362]
We present Multi-Objective Direct Preference Optimization (MODPO) for multiple alignment objectives. MODPO folds language modeling directly into reward modeling, training language models as implicit collective reward models. It theoretically yields the same optimal solutions as MORLHF but is practically more stable and efficient.
arXiv Detail & Related papers (2023-10-05T17:35:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.