DreamBoothDPO: Improving Personalized Generation using Direct Preference Optimization
- URL: http://arxiv.org/abs/2505.20975v1
- Date: Tue, 27 May 2025 10:07:50 GMT
- Title: DreamBoothDPO: Improving Personalized Generation using Direct Preference Optimization
- Authors: Shamil Ayupov, Maksim Nakhodnov, Anastasia Yaschenko, Andrey Kuznetsov, Aibek Alanov,
- Abstract summary: balancing concept fidelity with contextual alignment is a challenging open problem.<n>We propose an RL-based approach that leverages the diverse outputs of T2I models to address this issue.<n>Our method eliminates the need for human-annotated scores by generating a synthetic paired dataset for DPO-like training.
- Score: 2.5282283486446757
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Personalized diffusion models have shown remarkable success in Text-to-Image (T2I) generation by enabling the injection of user-defined concepts into diverse contexts. However, balancing concept fidelity with contextual alignment remains a challenging open problem. In this work, we propose an RL-based approach that leverages the diverse outputs of T2I models to address this issue. Our method eliminates the need for human-annotated scores by generating a synthetic paired dataset for DPO-like training using external quality metrics. These better-worse pairs are specifically constructed to improve both concept fidelity and prompt adherence. Moreover, our approach supports flexible adjustment of the trade-off between image fidelity and textual alignment. Through multi-step training, our approach outperforms a naive baseline in convergence speed and output quality. We conduct extensive qualitative and quantitative analysis, demonstrating the effectiveness of our method across various architectures and fine-tuning techniques. The source code can be found at https://github.com/ControlGenAI/DreamBoothDPO.
Related papers
- Diverse Text-to-Image Generation via Contrastive Noise Optimization [60.48914865049489]
Text-to-image (T2I) diffusion models have demonstrated impressive performance in generating high-fidelity images.<n>Existing approaches typically optimize intermediate latents or text conditions during inference.<n>We introduce Contrastive Noise Optimization, a simple yet effective method that addresses the diversity issue from a distinct perspective.
arXiv Detail & Related papers (2025-10-04T13:51:32Z) - Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z) - Towards a Unified View of Large Language Model Post-Training [27.906878681963263]
Two major sources of training data exist for post-training modern language models.<n>We show that approaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) are not in contradiction, but are instances of a single optimization process.<n>We propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals.
arXiv Detail & Related papers (2025-09-04T17:40:33Z) - MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding [2.778335169230448]
PP-DocBee2 is an advanced version of the PP-DocBee, designed to enhance multimodal document understanding.<n>Built on a large multimodal model architecture, PP-DocBee2 addresses the limitations of its predecessor through key technological improvements.<n>These enhancements yield an $11.4%$ performance boost on internal benchmarks for Chinese business documents, and reduce inference latency by $73.0%$ to the vanilla version.
arXiv Detail & Related papers (2025-06-22T13:06:13Z) - How Much To Guide: Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models [57.42800112251644]
We propose Step AG, which is a simple, universally applicable adaptive guidance strategy.<n>Our evaluations focus on both image quality and image-text alignment.
arXiv Detail & Related papers (2025-06-10T02:09:48Z) - Policy Optimized Text-to-Image Pipeline Design [72.87655664038617]
We introduce a novel reinforcement learning-based framework for text-to-image generation.<n>Our approach first trains an ensemble of reward models capable of predicting image quality scores directly from prompt-workflow combinations.<n>We then implement a two-phase training strategy: initial vocabulary training followed by GRPO-based optimization.
arXiv Detail & Related papers (2025-05-27T17:50:47Z) - MMaDA: Multimodal Large Diffusion Language Models [47.043301822171195]
We introduce MMaDA, a novel class of multimodal diffusion foundation models.<n>It is designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation.
arXiv Detail & Related papers (2025-05-21T17:59:05Z) - DanceGRPO: Unleashing GRPO on Visual Generation [42.567425922760144]
Reinforcement Learning (RL) has emerged as a promising approach for fine-tuning generative models.<n>Existing methods like DDPO and DPOK face fundamental limitations when scaling to large and diverse prompt sets.<n>This paper presents DanceGRPO, a framework that addresses these limitations through an innovative adaptation of Group Relative Policy Optimization.
arXiv Detail & Related papers (2025-05-12T17:59:34Z) - Beyond Fine-Tuning: A Systematic Study of Sampling Techniques in Personalized Image Generation [2.9631016562930546]
Balancing the fidelity of the learned concept with its ability for generation in various contexts presents a significant challenge.<n>Existing methods often address this through diverse fine-tuning parameterizations and improved sampling strategies.<n>We propose a decision framework evaluating text alignment, computational constraints, and fidelity objectives to guide strategy selection.
arXiv Detail & Related papers (2025-02-09T13:22:32Z) - Weak Supervision Dynamic KL-Weighted Diffusion Models Guided by Large Language Models [0.0]
We present a novel method for improving text-to-image generation by combining Large Language Models with diffusion models.<n>Our approach incorporates semantic understanding from pre-trained LLMs to guide the generation process.<n>Our method significantly improves both the visual quality and alignment of generated images with text descriptions.
arXiv Detail & Related papers (2025-02-02T15:43:13Z) - Customized Generation Reimagined: Fidelity and Editability Harmonized [30.92739649737791]
Customized generation aims to incorporate a novel concept into a pre-trained text-to-image model.<n> customized generation suffers from an inherent trade-off between concept fidelity and editability.
arXiv Detail & Related papers (2024-12-06T07:54:34Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects.
We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.