BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment
- URL: http://arxiv.org/abs/2511.19268v1
- Date: Mon, 24 Nov 2025 16:20:11 GMT
- Title: BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment
- Authors: Dewei Zhou, Mingwei Li, Zongxin Yang, Yu Lu, Yunqiu Xu, Zhizhong Wang, Zeyi Huang, Yi Yang,
- Abstract summary: Conditional image generation enhances text-to-image synthesis with structural, spatial, or stylistic priors.<n>Current methods face challenges in handling conflicts between sources.<n>We propose a bidirectionally decoupled DPO framework (BideDPO)
- Score: 53.94214054918876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conditional image generation enhances text-to-image synthesis with structural, spatial, or stylistic priors, but current methods face challenges in handling conflicts between sources. These include 1) input-level conflicts, where the conditioning image contradicts the text prompt, and 2) model-bias conflicts, where generative biases disrupt alignment even when conditions match the text. Addressing these conflicts requires nuanced solutions, which standard supervised fine-tuning struggles to provide. Preference-based optimization techniques like Direct Preference Optimization (DPO) show promise but are limited by gradient entanglement between text and condition signals and lack disentangled training data for multi-constraint tasks. To overcome this, we propose a bidirectionally decoupled DPO framework (BideDPO). Our method creates two disentangled preference pairs-one for the condition and one for the text-to reduce gradient entanglement. The influence of pairs is managed using an Adaptive Loss Balancing strategy for balanced optimization. We introduce an automated data pipeline to sample model outputs and generate conflict-aware data. This process is embedded in an iterative optimization strategy that refines both the model and the data. We construct a DualAlign benchmark to evaluate conflict resolution between text and condition. Experiments show BideDPO significantly improves text success rates (e.g., +35%) and condition adherence. We also validate our approach using the COCO dataset. Project Pages: https://limuloo.github.io/BideDPO/.
Related papers
- RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling [59.088798018184235]
textbfRAPO++ is a cross-stage prompt optimization framework.<n>It unifies training-data-aligned refinement, test-time iterative scaling, and large language model fine-tuning.<n> RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility.
arXiv Detail & Related papers (2025-10-23T04:45:09Z) - Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs [36.42060582800515]
We introduce Text Preference Optimization (TPO), a framework that enables "free-lunch" alignment of T2I models.<n>TPO works by training the model to prefer matched prompts over mismatched prompts.<n>Our framework is general and compatible with existing preference-based algorithms.
arXiv Detail & Related papers (2025-09-30T04:32:34Z) - RankPO: Preference Optimization for Job-Talent Matching [7.385902340910447]
We propose a two-stage training framework for large language models (LLMs)<n>In the first stage, a contrastive learning approach is used to train the model on a dataset constructed from real-world matching rules.<n>In the second stage, we introduce a novel preference-based fine-tuning method inspired by Direct Preference Optimization (DPO) to align the model with AI-curated pairwise preferences.
arXiv Detail & Related papers (2025-03-13T10:14:37Z) - Aligning Text to Image in Diffusion Models is Easier Than You Think [47.623236425067326]
SoftREPA is a lightweight contrastive fine-tuning strategy that leverages soft text tokens for representation alignment.<n>Our method explicitly increases the mutual information between text and image representations, leading to enhanced semantic consistency.
arXiv Detail & Related papers (2025-03-11T10:14:22Z) - C2-DPO: Constrained Controlled Direct Preference Optimization [22.730518243326394]
Direct preference optimization (textttDPO) has emerged as a promising approach for solving the alignment problem in AI.<n>We show that textttDPO loss could be derived by starting from an alternative optimization problem that only defines the KL guardrail on in-sample responses.
arXiv Detail & Related papers (2025-02-22T00:38:44Z) - Dual Caption Preference Optimization for Diffusion Models [53.218293277964165]
We introduce Dual Caption Preference Optimization (DCPO) to improve text-to-image diffusion models.<n>DCPO assigns two distinct captions to each preference pair, which reinforces the learning signal.<n>Experiments show that DCPO significantly improves image quality and relevance to prompts.
arXiv Detail & Related papers (2025-02-09T20:34:43Z) - Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization [67.8738082040299]
Self-Sampling Preference Optimization (SSPO) is a new alignment method for post-training reinforcement learning.<n>SSPO eliminates the need for paired data and reward models while retaining the training stability of SFT.<n>SSPO surpasses all previous approaches on the text-to-image benchmarks and demonstrates outstanding performance on the text-to-video benchmarks.
arXiv Detail & Related papers (2024-10-07T17:56:53Z) - OT-Attack: Enhancing Adversarial Transferability of Vision-Language
Models via Optimal Transport Optimization [65.57380193070574]
Vision-language pre-training models are vulnerable to multi-modal adversarial examples.
Recent works have indicated that leveraging data augmentation and image-text modal interactions can enhance the transferability of adversarial examples.
We propose an Optimal Transport-based Adversarial Attack, dubbed OT-Attack.
arXiv Detail & Related papers (2023-12-07T16:16:50Z) - Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models [35.02969643344228]
We present a training free approach called Text-Anchored Score Composition (TASC) to improve controllability of existing models.
We propose an attention operation to realign these independently calculated results via a cross-attention mechanism to avoid new conflicts when combining them back.
arXiv Detail & Related papers (2023-06-26T03:48:15Z) - Refign: Align and Refine for Adaptation of Semantic Segmentation to
Adverse Conditions [78.71745819446176]
Refign is a generic extension to self-training-based UDA methods which leverages cross-domain correspondences.
Refign consists of two steps: (1) aligning the normal-condition image to the corresponding adverse-condition image using an uncertainty-aware dense matching network, and (2) refining the adverse prediction with the normal prediction using an adaptive label correction mechanism.
The approach introduces no extra training parameters, minimal computational overhead -- during training only -- and can be used as a drop-in extension to improve any given self-training-based UDA method.
arXiv Detail & Related papers (2022-07-14T11:30:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.