DreamOmni: Unified Image Generation and Editing
- URL: http://arxiv.org/abs/2412.17098v1
- Date: Sun, 22 Dec 2024 17:17:28 GMT
- Title: DreamOmni: Unified Image Generation and Editing
- Authors: Bin Xia, Yuechen Zhang, Jingyao Li, Chengyao Wang, Yitong Wang, Xinglong Wu, Bei Yu, Jiaya Jia,
- Abstract summary: We introduce Dream Omni, a unified model for image generation and editing.
For training, Dream Omni jointly trains T2I generation and downstream tasks.
This collaboration significantly boosts editing performance.
- Score: 51.45871494724542
- License:
- Abstract: Currently, the success of large language models (LLMs) illustrates that a unified multitasking approach can significantly enhance model usability, streamline deployment, and foster synergistic benefits across different tasks. However, in computer vision, while text-to-image (T2I) models have significantly improved generation quality through scaling up, their framework design did not initially consider how to unify with downstream tasks, such as various types of editing. To address this, we introduce DreamOmni, a unified model for image generation and editing. We begin by analyzing existing frameworks and the requirements of downstream tasks, proposing a unified framework that integrates both T2I models and various editing tasks. Furthermore, another key challenge is the efficient creation of high-quality editing data, particularly for instruction-based and drag-based editing. To this end, we develop a synthetic data pipeline using sticker-like elements to synthesize accurate, high-quality datasets efficiently, which enables editing data scaling up for unified model training. For training, DreamOmni jointly trains T2I generation and downstream tasks. T2I training enhances the model's understanding of specific concepts and improves generation quality, while editing training helps the model grasp the nuances of the editing task. This collaboration significantly boosts editing performance. Extensive experiments confirm the effectiveness of DreamOmni. The code and model will be released.
Related papers
- EditAR: Unified Conditional Generation with Autoregressive Models [58.093860528672735]
We propose EditAR, a single unified autoregressive framework for a variety of conditional image generation tasks.
The model takes both images and instructions as inputs, and predicts the edited images tokens in a vanilla next-token paradigm.
We evaluate its effectiveness across diverse tasks on established benchmarks, showing competitive performance to various state-of-the-art task-specific methods.
arXiv Detail & Related papers (2025-01-08T18:59:35Z) - GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis [10.47359822447001]
We present an alternate paradigm for T2I synthesis, decomposing the task of complex multi-step generation into three steps.
Our approach derives its strength from the fact that it is modular in nature, is training free, and can be applied over any combination of image generation and editing models.
arXiv Detail & Related papers (2024-12-08T22:29:56Z) - SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing [50.098005973600024]
We propose a novel video generation and editing system powered by our Semantic Planning Agent (SPAgent)
SPAgent bridges the gap between diverse user intents and the effective utilization of existing generative models.
Experimental results demonstrate that the SPAgent effectively coordinates models to generate or edit videos.
arXiv Detail & Related papers (2024-11-28T08:07:32Z) - Image Inpainting Models are Effective Tools for Instruction-guided Image Editing [42.63350374074953]
This technique report is for the winning solution of the CVPR2024 GenAI Media Generation Challenge Workshop's Instruction-guided Image Editing track.
We use a 4-step process IIIE (Inpainting-based Instruction-guided Image Editing): editing category classification, main editing object identification, editing mask acquisition, and image inpainting.
Results show that through proper combinations of language models and image inpainting models, our pipeline can reach a high success rate with satisfying visual quality.
arXiv Detail & Related papers (2024-07-18T03:55:33Z) - A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users.
Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models.
T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z) - VASE: Object-Centric Appearance and Shape Manipulation of Real Videos [108.60416277357712]
In this work, we introduce a framework that is object-centric and is designed to control both the object's appearance and, notably, to execute precise and explicit structural modifications on the object.
We build our framework on a pre-trained image-conditioned diffusion model, integrate layers to handle the temporal dimension, and propose training strategies and architectural modifications to enable shape control.
We evaluate our method on the image-driven video editing task showing similar performance to the state-of-the-art, and showcasing novel shape-editing capabilities.
arXiv Detail & Related papers (2024-01-04T18:59:24Z) - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z) - CoditT5: Pretraining for Source Code and Natural Language Editing [34.77621217370665]
CoditT5 is a large language model for software-related editing tasks that is pretrained on large amounts of source code and natural language comments.
We fine-tune it on various downstream editing tasks, including comment updating, bug fixing, and automated code review.
arXiv Detail & Related papers (2022-08-10T16:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.