OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction
- URL: http://arxiv.org/abs/2410.04932v1
- Date: Mon, 7 Oct 2024 11:26:13 GMT
- Title: OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction
- Authors: Leheng Li, Weichao Qiu, Xu Yan, Jing He, Kaiqiang Zhou, Yingjie Cai, Qing Lian, Bingbing Liu, Ying-Cong Chen,
- Abstract summary: We present OmniBooth, an image generation framework that enables spatial control with instance-level multi-modal customization.
Our approach significantly expands the scope of text-to-image generation, and elevates it to a more versatile and practical dimension in controllability.
- Score: 32.08995899903304
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present OmniBooth, an image generation framework that enables spatial control with instance-level multi-modal customization. For all instances, the multimodal instruction can be described through text prompts or image references. Given a set of user-defined masks and associated text or image guidance, our objective is to generate an image, where multiple objects are positioned at specified coordinates and their attributes are precisely aligned with the corresponding guidance. This approach significantly expands the scope of text-to-image generation, and elevates it to a more versatile and practical dimension in controllability. In this paper, our core contribution lies in the proposed latent control signals, a high-dimensional spatial feature that provides a unified representation to integrate the spatial, textual, and image conditions seamlessly. The text condition extends ControlNet to provide instance-level open-vocabulary generation. The image condition further enables fine-grained control with personalized identity. In practice, our method empowers users with more flexibility in controllable generation, as users can choose multi-modal conditions from text or images as needed. Furthermore, thorough experiments demonstrate our enhanced performance in image synthesis fidelity and alignment across different tasks and datasets. Project page: https://len-li.github.io/omnibooth-web/
Related papers
- Generating Compositional Scenes via Text-to-image RGBA Instance Generation [82.63805151691024]
Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering.
We propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity.
Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes.
arXiv Detail & Related papers (2024-11-16T23:44:14Z) - PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions [66.92809850624118]
PixWizard is an image-to-image visual assistant designed for image generation, manipulation, and translation based on free-from language instructions.
We tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning dataset.
Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions.
arXiv Detail & Related papers (2024-09-23T17:59:46Z) - AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation [24.07613591217345]
Linguistic control enables effective content creation, but struggles with fine-grained control over image generation.
AnyControl develops a novel Multi-Control framework that extracts a unified multi-modal embedding to guide the generation process.
This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals.
arXiv Detail & Related papers (2024-06-27T07:40:59Z) - Caption Anything: Interactive Image Description with Diverse Multimodal
Controls [14.628597750669275]
Controllable image captioning aims to describe the image with natural language following human purpose.
We present Caption AnyThing, a foundation model augmented image captioning framework.
Powered by Segment Anything Model (SAM) and ChatGPT, we unify the visual and language prompts into a modularized framework.
arXiv Detail & Related papers (2023-05-04T09:48:22Z) - FlexIT: Towards Flexible Semantic Image Translation [59.09398209706869]
We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing.
First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space.
We iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms.
arXiv Detail & Related papers (2022-03-09T13:34:38Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - Learning Multimodal Affinities for Textual Editing in Images [18.7418059568887]
We devise a generic unsupervised technique to learn multimodal affinities between textual entities in a document-image.
We then use these learned affinities to automatically cluster the textual entities in the image into different semantic groups.
We show that our technique can operate on highly varying images spanning a wide range of documents and demonstrate its applicability for various editing operations.
arXiv Detail & Related papers (2021-03-18T10:09:57Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.