Related papers: Margin-aware Preference Optimization for Aligning Diffusion Models without Reference

Margin-aware Preference Optimization for Aligning Diffusion Models without Reference

URL: http://arxiv.org/abs/2406.06424v1
Date: Mon, 10 Jun 2024 16:14:45 GMT
Title: Margin-aware Preference Optimization for Aligning Diffusion Models without Reference
Authors: Jiwoo Hong, Sayak Paul, Noah Lee, Kashif Rasul, James Thorne, Jongheon Jeong,
Abstract summary: This paper focuses on the alignment of recent text-to-image diffusion models, such as Stable Diffusion XL (SDXL) We propose a novel and memory-friendly preference alignment method for diffusion models that does not depend on any reference model, coined margin-aware preference optimization (MaPO) MaPO jointly maximizes the likelihood margin between the preferred and dispreferred image sets and the likelihood of the preferred sets, simultaneously learning general stylistic features and preferences.
Score: 19.397326645617422
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern alignment techniques based on human preferences, such as RLHF and DPO, typically employ divergence regularization relative to the reference model to ensure training stability. However, this often limits the flexibility of models during alignment, especially when there is a clear distributional discrepancy between the preference data and the reference model. In this paper, we focus on the alignment of recent text-to-image diffusion models, such as Stable Diffusion XL (SDXL), and find that this "reference mismatch" is indeed a significant problem in aligning these models due to the unstructured nature of visual modalities: e.g., a preference for a particular stylistic aspect can easily induce such a discrepancy. Motivated by this observation, we propose a novel and memory-friendly preference alignment method for diffusion models that does not depend on any reference model, coined margin-aware preference optimization (MaPO). MaPO jointly maximizes the likelihood margin between the preferred and dispreferred image sets and the likelihood of the preferred sets, simultaneously learning general stylistic features and preferences. For evaluation, we introduce two new pairwise preference datasets, which comprise self-generated image pairs from SDXL, Pick-Style and Pick-Safety, simulating diverse scenarios of reference mismatch. Our experiments validate that MaPO can significantly improve alignment on Pick-Style and Pick-Safety and general preference alignment when used with Pick-a-Pic v2, surpassing the base SDXL and other existing methods. Our code, models, and datasets are publicly available via https://mapo-t2i.github.io

Related papers

DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models [50.32663816994459]
Diffusion-styled Preference Optimization (model) provides an efficient and policy-agnostic solution for aligning LLMs with humans. modelavoids the time latency associated with token-level generation. Experiments on AlpacaEval 2, MT-bench, and HH-RLHF demonstrate that modelachieves superior alignment performance across various settings.
arXiv Detail & Related papers (2025-03-06T09:21:54Z)
Dual Caption Preference Optimization for Diffusion Models [51.223275938663235]
We propose Dual Caption Preference Optimization (DCPO), a novel approach that utilizes two distinct captions to mitigate irrelevant prompts. Our experiments show that DCPO significantly improves image quality and relevance to prompts, outperforming Stable Diffusion (SD) 2.1, SFT_Chosen, Diffusion-DPO, and MaPO across multiple metrics.
arXiv Detail & Related papers (2025-02-09T20:34:43Z)
Calibrated Multi-Preference Optimization for Aligning Diffusion Models [92.90660301195396]
Calibrated Preference Optimization (CaPO) is a novel method to align text-to-image (T2I) diffusion models. CaPO incorporates the general preference from multiple reward models without human annotated data. Experimental results show that CaPO consistently outperforms prior methods.
arXiv Detail & Related papers (2025-02-04T18:59:23Z)
Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization [46.888425016169144]
Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically leverage Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. In this work, we demonstrate that diffusion models are inherently well-suited for step-level reward modeling in the latent space. We introduce Latent Preference Optimization (LPO), a method designed for step-level preference optimization directly in the latent space.
arXiv Detail & Related papers (2025-02-03T04:51:28Z)
Personalized Preference Fine-tuning of Diffusion Models [75.22218338096316]
We introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences. With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way. Our approach achieves an average win rate of 76% over Stable Cascade, generating images that more accurately reflect specific user preferences.
arXiv Detail & Related papers (2025-01-11T22:38:41Z)
SePPO: Semi-Policy Preference Optimization for Diffusion Alignment [67.8738082040299]
We propose a preference optimization method that aligns DMs with preferences without relying on reward models or paired human-annotated data. We validate SePPO across both text-to-image and text-to-video benchmarks.
arXiv Detail & Related papers (2024-10-07T17:56:53Z)
Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback [64.67540769692074]
Large language models (LLMs) fine-tuned with alignment techniques, such as reinforcement learning from human feedback, have been instrumental in developing some of the most capable AI systems to date. We introduce an approach called Margin Matching Preference Optimization (MMPO), which incorporates relative quality margins into optimization, leading to improved LLM policies and reward models. Experiments with both human and AI feedback data demonstrate that MMPO consistently outperforms baseline methods, often by a substantial margin, on popular benchmarks including MT-bench and RewardBench.
arXiv Detail & Related papers (2024-10-04T04:56:11Z)
General Preference Modeling with Preference Representations for Aligning Language Models [51.14207112118503]
We introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently. We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z)
Modulated Intervention Preference Optimization (MIPO): Keep the Easy, Refine the Difficult [0.48951183832371004]
We propose textbfModulated Intervention Preference Optimization (MIPO) to address this issue. MIPO modulates the degree of intervention from the reference model based on how well the given data is aligned with it. We compare the performance of MIPO and DPO using Mistral-7B and Llama3-8B in Alpaca Eval 2.0 and MT-Bench.
arXiv Detail & Related papers (2024-09-26T05:24:14Z)
Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization [68.69203905664524]
We introduce Diffusion-RPO, a new method designed to align diffusion-based T2I models with human preferences more effectively. We have developed a new evaluation metric, style alignment, aimed at overcoming the challenges of high costs, low interpretability. Our findings demonstrate that Diffusion-RPO outperforms established methods such as Supervised Fine-Tuning and Diffusion-DPO in tuning Stable Diffusion versions 1.5 and XL-1.0.
arXiv Detail & Related papers (2024-06-10T15:42:03Z)
Preference Alignment with Flow Matching [23.042382086241364]
Preference Flow Matching (PFM) is a new framework for preference-based reinforcement learning (PbRL) It streamlines the integration of preferences into an arbitrary class of pre-trained models. We provide theoretical insights that support our method's alignment with standard PbRL objectives.
arXiv Detail & Related papers (2024-05-30T08:16:22Z)
Diffusion Model Alignment Using Direct Preference Optimization [103.2238655827797]
Diffusion-DPO is a method to align diffusion models to human preferences by directly optimizing on human comparison data. We fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. We also develop a variant that uses AI feedback and has comparable performance to training on human preferences.
arXiv Detail & Related papers (2023-11-21T15:24:05Z)
Autoregressive Score Matching [113.4502004812927]
We propose autoregressive conditional score models (AR-CSM) where we parameterize the joint distribution in terms of the derivatives of univariable log-conditionals (scores) For AR-CSM models, this divergence between data and model distributions can be computed and optimized efficiently, requiring no expensive sampling or adversarial training. We show with extensive experimental results that it can be applied to density estimation on synthetic data, image generation, image denoising, and training latent variable models with implicit encoders.
arXiv Detail & Related papers (2020-10-24T07:01:24Z)
Rationalizing Text Matching: Learning Sparse Alignments via Optimal Transport [14.86310501896212]
In this work, we extend this selective rationalization approach to text matching. The goal is to jointly select and align text pieces, such as tokens or sentences, as a justification for the downstream prediction. Our approach employs optimal transport (OT) to find a minimal cost alignment between the inputs.
arXiv Detail & Related papers (2020-05-27T01:20:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.