Object-centric Self-improving Preference Optimization for Text-to-Image Generation
- URL: http://arxiv.org/abs/2506.02015v1
- Date: Wed, 28 May 2025 03:45:42 GMT
- Title: Object-centric Self-improving Preference Optimization for Text-to-Image Generation
- Authors: Yoonjin Oh, Yongjin Kim, Hyomin Kim, Donghwan Chi, Sungwoong Kim,
- Abstract summary: We propose an Object-centric Self-improving Preference Optimization framework for text-to-image generation by MLLMs.<n> OSPO emphasizes the importance of high-quality preference pair data, which is critical for effective preference optimization.<n>We validate OSPO on three representative compositional text-to-image benchmarks, demonstrating substantial performance gains over baseline models.
- Score: 10.87176643368746
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have significantly improved both image understanding and generation capabilities. Despite these improvements, MLLMs still struggle with fine-grained visual comprehension, particularly in text-to-image generation tasks. While preference optimization methods have been explored to address these limitations in image understanding tasks, their application to image generation remains largely underexplored. To address this gap, we propose an Object-centric Self-improving Preference Optimization (OSPO) framework designed for text-to-image generation by MLLMs. OSPO leverages the intrinsic reasoning abilities of MLLMs without requiring any external datasets or models. OSPO emphasizes the importance of high-quality preference pair data, which is critical for effective preference optimization. To achieve this, it introduces a self-improving mechanism that autonomously constructs object-level contrastive preference pairs through object-centric prompt perturbation, densification and VQA scoring. This process eliminates ambiguous or disproportionate variations commonly found in naively generated preference pairs, thereby enhancing the effectiveness of preference optimization. We validate OSPO on three representative compositional text-to-image benchmarks, demonstrating substantial performance gains over baseline models.
Related papers
- Interleaving Reasoning for Better Text-to-Image Generation [83.69082794730664]
We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis.<n>To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals.<n>Experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN.
arXiv Detail & Related papers (2025-09-08T17:56:23Z) - Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs [74.74767980885758]
We propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework.<n>CcDPO enhances per-image perception in multi-image settings by zooming into visual clues -- from sequential context to local details.<n> Experiments show that CcDPO significantly reduces hallucinations and yields consistent performance gains.
arXiv Detail & Related papers (2025-05-28T14:24:02Z) - AdaViP: Aligning Multi-modal LLMs via Adaptive Vision-enhanced Preference Optimization [26.03204301595711]
We propose an Adaptive Vision-enhanced Preference optimization (AdaViP) that addresses limitations through two key innovations.<n> vision-based preference pair construction integrates multiple visual foundation models to strategically remove key visual elements from the image.<n>AdaViP-7B achieves 93.7% and 96.4% reductions in response-level and mentioned-level hallucination respectively on the Object HalBench.
arXiv Detail & Related papers (2025-04-22T06:19:38Z) - Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment [55.74860093731475]
Marmot is a novel framework that employs Multi-Agent Reasoning for Multi-Object Self-Correcting.<n>We construct a multi-agent self-correcting system featuring a decision-execution-verification mechanism.<n>Experiments demonstrate that Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships.
arXiv Detail & Related papers (2025-04-10T16:54:28Z) - Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization [18.855378039713678]
Large Vision Language Models (VLMs) are prone to significant hallucinations, particularly in the form of cross-modal inconsistencies.<n>We introduce Re-Align, a novel alignment framework that leverages image retrieval to construct a dual-preference dataset.<n>We also introduce rDPO, an extension of the standard direct preference optimization that incorporates an additional visual preference objective during fine-tuning.
arXiv Detail & Related papers (2025-02-18T18:59:57Z) - CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs [107.21334626890713]
Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities.<n>We propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations.<n>We evaluate CHiP through both quantitative and qualitative analyses, with results across multiple benchmarks demonstrating its effectiveness in reducing hallucinations.
arXiv Detail & Related papers (2025-01-28T02:05:38Z) - Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [86.69947123512836]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks.<n>We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation.<n>We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z) - Beyond Human Data: Aligning Multimodal Large Language Models by Iterative Self-Evolution [43.07899102255169]
We propose a novel multimodal self-evolution framework that enables the model to autonomously generate high-quality questions and answers.<n>First, we implement an image-driven self-questioning mechanism, allowing the model to create and evaluate questions based on image content.<n>Second, we introduce an answer self-enhancement technique, starting with image captioning to improve answer quality.
arXiv Detail & Related papers (2024-12-20T08:06:00Z) - MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models [85.30735602813093]
Multi-Image Augmented Direct Preference Optimization (MIA-DPO) is a visual preference alignment approach that effectively handles multi-image inputs.
MIA-DPO mitigates the scarcity of diverse multi-image training data by extending single-image data with unrelated images arranged in grid collages or pic-in-pic formats.
arXiv Detail & Related papers (2024-10-23T07:56:48Z) - Modality-Fair Preference Optimization for Trustworthy MLLM Alignment [11.796170286878056]
Direct Preference Optimization (DPO) is effective for aligning large language models (LLMs)
It often favors text over image information, leading to unreliable outputs and visual hallucinations.
We propose Modality-Fair Preference Optimization (MFPO) to balance text and image preferences.
arXiv Detail & Related papers (2024-10-20T08:56:52Z) - EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.<n>EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss.
The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z) - A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap [50.079224604394]
We present a novel model-agnostic framework called textbfContext-textbfEnhanced textbfFeature textbfAment (CEFA)
CEFA consists of a feature alignment module and a context enhancement module.
Our method can serve as a plug-and-play module to improve the detection performance of HOI models on rare categories.
arXiv Detail & Related papers (2024-07-31T08:42:48Z) - OrientDream: Streamlining Text-to-3D Generation with Explicit Orientation Control [66.03885917320189]
OrientDream is a camera orientation conditioned framework for efficient and multi-view consistent 3D generation from textual prompts.
Our strategy emphasizes the implementation of an explicit camera orientation conditioned feature in the pre-training of a 2D text-to-image diffusion module.
Our experiments reveal that our method not only produces high-quality NeRF models with consistent multi-view properties but also achieves an optimization speed significantly greater than existing methods.
arXiv Detail & Related papers (2024-06-14T13:16:18Z) - Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement [102.22911097049953]
Large vision-language models (LVLMs) have achieved impressive results in visual question-answering and reasoning tasks.<n>Existing methods often depend on external models or data, leading to uncontrollable and unstable alignment results.<n>We propose SIMA, a self-improvement framework that enhances visual and language modality alignment without external dependencies.
arXiv Detail & Related papers (2024-05-24T23:09:27Z) - Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects.
We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z) - Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with
Prototypical Embedding [7.893308498886083]
Our proposed method aims to address the challenges of generalizability and fidelity in an object-driven way.
A prototypical embedding is based on the object's appearance and its class, before fine-tuning the diffusion model.
Our method outperforms several existing works.
arXiv Detail & Related papers (2024-01-28T17:11:42Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects.
Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity.
We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.