Related papers: DreamBoothDPO: Improving Personalized Generation using Direct Preference Optimization

DreamBoothDPO: Improving Personalized Generation using Direct Preference Optimization

URL: http://arxiv.org/abs/2505.20975v1
Date: Tue, 27 May 2025 10:07:50 GMT
Title: DreamBoothDPO: Improving Personalized Generation using Direct Preference Optimization
Authors: Shamil Ayupov, Maksim Nakhodnov, Anastasia Yaschenko, Andrey Kuznetsov, Aibek Alanov,
Abstract summary: balancing concept fidelity with contextual alignment is a challenging open problem.<n>We propose an RL-based approach that leverages the diverse outputs of T2I models to address this issue.<n>Our method eliminates the need for human-annotated scores by generating a synthetic paired dataset for DPO-like training.
Score: 2.5282283486446757
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Personalized diffusion models have shown remarkable success in Text-to-Image (T2I) generation by enabling the injection of user-defined concepts into diverse contexts. However, balancing concept fidelity with contextual alignment remains a challenging open problem. In this work, we propose an RL-based approach that leverages the diverse outputs of T2I models to address this issue. Our method eliminates the need for human-annotated scores by generating a synthetic paired dataset for DPO-like training using external quality metrics. These better-worse pairs are specifically constructed to improve both concept fidelity and prompt adherence. Moreover, our approach supports flexible adjustment of the trade-off between image fidelity and textual alignment. Through multi-step training, our approach outperforms a naive baseline in convergence speed and output quality. We conduct extensive qualitative and quantitative analysis, demonstrating the effectiveness of our method across various architectures and fine-tuning techniques. The source code can be found at https://github.com/ControlGenAI/DreamBoothDPO.

Related papers

Diverse Text-to-Image Generation via Contrastive Noise Optimization [60.48914865049489]
Text-to-image (T2I) diffusion models have demonstrated impressive performance in generating high-fidelity images.<n>Existing approaches typically optimize intermediate latents or text conditions during inference.<n>We introduce Contrastive Noise Optimization, a simple yet effective method that addresses the diversity issue from a distinct perspective.
arXiv Detail & Related papers (2025-10-04T13:51:32Z)
Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z)
Towards a Unified View of Large Language Model Post-Training [27.906878681963263]
Two major sources of training data exist for post-training modern language models.<n>We show that approaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) are not in contradiction, but are instances of a single optimization process.<n>We propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals.
arXiv Detail & Related papers (2025-09-04T17:40:33Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding [2.778335169230448]
PP-DocBee2 is an advanced version of the PP-DocBee, designed to enhance multimodal document understanding.<n>Built on a large multimodal model architecture, PP-DocBee2 addresses the limitations of its predecessor through key technological improvements.<n>These enhancements yield an $11.4%$ performance boost on internal benchmarks for Chinese business documents, and reduce inference latency by $73.0%$ to the vanilla version.
arXiv Detail & Related papers (2025-06-22T13:06:13Z)
How Much To Guide: Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models [57.42800112251644]
We propose Step AG, which is a simple, universally applicable adaptive guidance strategy.<n>Our evaluations focus on both image quality and image-text alignment.
arXiv Detail & Related papers (2025-06-10T02:09:48Z)
Policy Optimized Text-to-Image Pipeline Design [72.87655664038617]
We introduce a novel reinforcement learning-based framework for text-to-image generation.<n>Our approach first trains an ensemble of reward models capable of predicting image quality scores directly from prompt-workflow combinations.<n>We then implement a two-phase training strategy: initial vocabulary training followed by GRPO-based optimization.
arXiv Detail & Related papers (2025-05-27T17:50:47Z)
MMaDA: Multimodal Large Diffusion Language Models [47.043301822171195]
We introduce MMaDA, a novel class of multimodal diffusion foundation models.<n>It is designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation.
arXiv Detail & Related papers (2025-05-21T17:59:05Z)
DanceGRPO: Unleashing GRPO on Visual Generation [42.567425922760144]
Reinforcement Learning (RL) has emerged as a promising approach for fine-tuning generative models.<n>Existing methods like DDPO and DPOK face fundamental limitations when scaling to large and diverse prompt sets.<n>This paper presents DanceGRPO, a framework that addresses these limitations through an innovative adaptation of Group Relative Policy Optimization.
arXiv Detail & Related papers (2025-05-12T17:59:34Z)
Beyond Fine-Tuning: A Systematic Study of Sampling Techniques in Personalized Image Generation [2.9631016562930546]
Balancing the fidelity of the learned concept with its ability for generation in various contexts presents a significant challenge.<n>Existing methods often address this through diverse fine-tuning parameterizations and improved sampling strategies.<n>We propose a decision framework evaluating text alignment, computational constraints, and fidelity objectives to guide strategy selection.
arXiv Detail & Related papers (2025-02-09T13:22:32Z)
Weak Supervision Dynamic KL-Weighted Diffusion Models Guided by Large Language Models [0.0]
We present a novel method for improving text-to-image generation by combining Large Language Models with diffusion models.<n>Our approach incorporates semantic understanding from pre-trained LLMs to guide the generation process.<n>Our method significantly improves both the visual quality and alignment of generated images with text descriptions.
arXiv Detail & Related papers (2025-02-02T15:43:13Z)
Customized Generation Reimagined: Fidelity and Editability Harmonized [30.92739649737791]
Customized generation aims to incorporate a novel concept into a pre-trained text-to-image model.<n> customized generation suffers from an inherent trade-off between concept fidelity and editability.
arXiv Detail & Related papers (2024-12-06T07:54:34Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects. We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z)
When Parameter-efficient Tuning Meets General-purpose Vision-language Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique. Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.