DreamO: A Unified Framework for Image Customization
- URL: http://arxiv.org/abs/2504.16915v3
- Date: Tue, 13 May 2025 04:49:35 GMT
- Title: DreamO: A Unified Framework for Image Customization
- Authors: Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, Mengtian Li, Mingcong Liu, Yi Zhang, Shaojin Wu, Songtao Zhao, Jian Zhang, Qian He, Xinglong Wu,
- Abstract summary: We present DreamO, an image customization framework designed to support a wide range of tasks while facilitating seamless integration of multiple conditions.<n>Specifically, DreamO utilizes a diffusion transformer (DiT) framework to uniformly process input of different types.<n>We employ a progressive training strategy consisting of three stages: an initial stage focused on simple tasks with limited data to establish baseline consistency, a full-scale training stage to comprehensively enhance the customization capabilities, and a final quality alignment stage to correct quality biases introduced by low-quality data.
- Score: 23.11440970488944
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, extensive research on image customization (e.g., identity, subject, style, background, etc.) demonstrates strong customization capabilities in large-scale generative models. However, most approaches are designed for specific tasks, restricting their generalizability to combine different types of condition. Developing a unified framework for image customization remains an open challenge. In this paper, we present DreamO, an image customization framework designed to support a wide range of tasks while facilitating seamless integration of multiple conditions. Specifically, DreamO utilizes a diffusion transformer (DiT) framework to uniformly process input of different types. During training, we construct a large-scale training dataset that includes various customization tasks, and we introduce a feature routing constraint to facilitate the precise querying of relevant information from reference images. Additionally, we design a placeholder strategy that associates specific placeholders with conditions at particular positions, enabling control over the placement of conditions in the generated results. Moreover, we employ a progressive training strategy consisting of three stages: an initial stage focused on simple tasks with limited data to establish baseline consistency, a full-scale training stage to comprehensively enhance the customization capabilities, and a final quality alignment stage to correct quality biases introduced by low-quality data. Extensive experiments demonstrate that the proposed DreamO can effectively perform various image customization tasks with high quality and flexibly integrate different types of control conditions.
Related papers
- Exploring Scalable Unified Modeling for General Low-Level Vision [39.89755374452788]
Low-level vision involves a wide spectrum of tasks, including image restoration, enhancement, stylization, and feature extraction.<n>To address the challenge of unified modeling across such diverse tasks, we propose a Visual task Prompt-based Image Processing framework.<n>We develop a unified low-level vision model, GenLV, and evaluate its performance across multiple representative tasks.
arXiv Detail & Related papers (2025-07-20T03:22:52Z) - IC-Custom: Diverse Image Customization via In-Context Learning [72.92059781700594]
IC-Custom is a unified framework that seamlessly integrates position-aware and position-free image customization.<n>It supports various industrial applications, including try-on, accessory placement, furniture arrangement, and creative IP customization.<n>It achieves 73% higher human preference across identity consistency, harmonicity, and text alignment metrics, while training only 0.4% of the original model parameters.
arXiv Detail & Related papers (2025-07-02T17:36:38Z) - VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning [68.98988753763666]
We propose VisualCloze, a universal image generation framework.<n>VisualCloze supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation.<n>We introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge.
arXiv Detail & Related papers (2025-04-10T17:59:42Z) - AnySynth: Harnessing the Power of Image Synthetic Data Generation for Generalized Vision-Language Tasks [23.041812897803034]
We propose Any Synth, a unified framework capable of generating arbitrary type of synthetic data.<n>We have validated our framework's performance across various tasks, including Few-shot Object Detection, Cross-domain Object Detection, Zero-shot Image Retrieval, and Multi-modal Image Perception and Grounding.
arXiv Detail & Related papers (2024-11-24T04:49:07Z) - JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation [49.997839600988875]
Existing personalization methods rely on finetuning a text-to-image foundation model on a user's custom dataset.
We propose Joint-Image Diffusion (jedi), an effective technique for learning a finetuning-free personalization model.
Our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.
arXiv Detail & Related papers (2024-07-08T17:59:02Z) - Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond [87.1712108247199]
Our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP)
We develop a generic and personalization generative framework, that can handle a wide range of personalized needs.
Our methodology enhances the capabilities of foundational language models for personalized tasks.
arXiv Detail & Related papers (2024-03-15T20:21:31Z) - Unifying Image Processing as Visual Prompting Question Answering [62.84955983910612]
Image processing is a fundamental task in computer vision, which aims at enhancing image quality and extracting essential features for subsequent vision applications.
Traditionally, task-specific models are developed for individual tasks and designing such models requires distinct expertise.
We propose a universal model for general image processing that covers image restoration, image enhancement, image feature extraction tasks.
arXiv Detail & Related papers (2023-10-16T15:32:57Z) - Cones 2: Customizable Image Synthesis with Multiple Subjects [50.54010141032032]
We study how to efficiently represent a particular subject as well as how to appropriately compose different subjects.
By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image.
arXiv Detail & Related papers (2023-05-30T18:00:06Z) - TSIT: A Simple and Versatile Framework for Image-to-Image Translation [103.92203013154403]
We introduce a simple and versatile framework for image-to-image translation.
We provide a carefully designed two-stream generative model with newly proposed feature transformations.
This allows multi-scale semantic structure information and style representation to be effectively captured and fused by the network.
A systematic study compares the proposed method with several state-of-the-art task-specific baselines, verifying its effectiveness in both perceptual quality and quantitative evaluations.
arXiv Detail & Related papers (2020-07-23T15:34:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.