Generate Anything Anywhere in Any Scene
- URL: http://arxiv.org/abs/2306.17154v1
- Date: Thu, 29 Jun 2023 17:55:14 GMT
- Title: Generate Anything Anywhere in Any Scene
- Authors: Yuheng Li, Haotian Liu, Yangming Wen, Yong Jae Lee
- Abstract summary: We propose a controllable text-to-image diffusion model for personalized object generation.
Our approach demonstrates significant potential for various applications, such as those in art, entertainment, and advertising design.
- Score: 25.75076439397536
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Text-to-image diffusion models have attracted considerable interest due to
their wide applicability across diverse fields. However, challenges persist in
creating controllable models for personalized object generation. In this paper,
we first identify the entanglement issues in existing personalized generative
models, and then propose a straightforward and efficient data augmentation
training strategy that guides the diffusion model to focus solely on object
identity. By inserting the plug-and-play adapter layers from a pre-trained
controllable diffusion model, our model obtains the ability to control the
location and size of each generated personalized object. During inference, we
propose a regionally-guided sampling technique to maintain the quality and
fidelity of the generated images. Our method achieves comparable or superior
fidelity for personalized objects, yielding a robust, versatile, and
controllable text-to-image diffusion model that is capable of generating
realistic and personalized images. Our approach demonstrates significant
potential for various applications, such as those in art, entertainment, and
advertising design.
Related papers
- Towards Reliable Verification of Unauthorized Data Usage in Personalized Text-to-Image Diffusion Models [23.09033991200197]
New personalization techniques have been proposed to customize the pre-trained base models for crafting images with specific themes or styles.
Such a lightweight solution poses a new concern regarding whether the personalized models are trained from unauthorized data.
We introduce SIREN, a novel methodology to proactively trace unauthorized data usage in black-box personalized text-to-image diffusion models.
arXiv Detail & Related papers (2024-10-14T12:29:23Z) - Diffusion Models For Multi-Modal Generative Modeling [32.61765315067488]
We propose a principled way to define a diffusion model by constructing a unified multi-modal diffusion model in a common diffusion space.
We propose several multimodal generation settings to verify our framework, including image transition, masked-image training, joint image-label and joint image-representation generative modeling.
arXiv Detail & Related papers (2024-07-24T18:04:17Z) - InsertDiffusion: Identity Preserving Visualization of Objects through a Training-Free Diffusion Architecture [0.0]
InsertDiffusion is a training-free diffusion architecture that efficiently embeds objects into images.
Our approach utilizes off-the-shelf generative models and eliminates the need for fine-tuning.
By decomposing the generation task into independent steps, InsertDiffusion offers a scalable solution.
arXiv Detail & Related papers (2024-07-15T10:15:58Z) - JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation [49.997839600988875]
Existing personalization methods rely on finetuning a text-to-image foundation model on a user's custom dataset.
We propose Joint-Image Diffusion (jedi), an effective technique for learning a finetuning-free personalization model.
Our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.
arXiv Detail & Related papers (2024-07-08T17:59:02Z) - Regularized Training with Generated Datasets for Name-Only Transfer of Vision-Language Models [36.59260354292177]
Recent advancements in text-to-image generation have inspired researchers to generate datasets tailored for perception models using generative models.
We aim to fine-tune vision-language models to a specific classification model without access to any real images.
Despite the high fidelity of generated images, we observed a significant performance degradation when fine-tuning the model using the generated datasets.
arXiv Detail & Related papers (2024-06-08T10:43:49Z) - YaART: Yet Another ART Rendering Technology [119.09155882164573]
This study introduces YaART, a novel production-grade text-to-image cascaded diffusion model aligned to human preferences.
We analyze how these choices affect both the efficiency of the training process and the quality of the generated images.
We demonstrate that models trained on smaller datasets of higher-quality images can successfully compete with those trained on larger datasets.
arXiv Detail & Related papers (2024-04-08T16:51:19Z) - Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond [87.1712108247199]
Our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP)
We develop a generic and personalization generative framework, that can handle a wide range of personalized needs.
Our methodology enhances the capabilities of foundational language models for personalized tasks.
arXiv Detail & Related papers (2024-03-15T20:21:31Z) - Bridging Generative and Discriminative Models for Unified Visual
Perception with Diffusion Priors [56.82596340418697]
We propose a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an adapted expert providing discriminative priors.
Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages.
The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.
arXiv Detail & Related papers (2024-01-29T10:36:57Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - Cross-domain Compositing with Pretrained Diffusion Models [34.98199766006208]
We employ a localized, iterative refinement scheme which infuses the injected objects with contextual information derived from the background scene.
Our method produces higher quality and realistic results without requiring any annotations or training.
arXiv Detail & Related papers (2023-02-20T18:54:04Z) - Extracting Training Data from Diffusion Models [77.11719063152027]
We show that diffusion models memorize individual images from their training data and emit them at generation time.
With a generate-and-filter pipeline, we extract over a thousand training examples from state-of-the-art models.
We train hundreds of diffusion models in various settings to analyze how different modeling and data decisions affect privacy.
arXiv Detail & Related papers (2023-01-30T18:53:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.