ObjectComposer: Consistent Generation of Multiple Objects Without
Fine-tuning
- URL: http://arxiv.org/abs/2310.06968v1
- Date: Tue, 10 Oct 2023 19:46:58 GMT
- Title: ObjectComposer: Consistent Generation of Multiple Objects Without
Fine-tuning
- Authors: Alec Helbling, Evan Montoya, Duen Horng Chau
- Abstract summary: We introduce ObjectComposer for generating compositions of multiple objects that resemble user-specified images.
Our approach is training-free, leveraging the abilities of preexisting models.
- Score: 25.033615513933192
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent text-to-image generative models can generate high-fidelity images from
text prompts. However, these models struggle to consistently generate the same
objects in different contexts with the same appearance. Consistent object
generation is important to many downstream tasks like generating comic book
illustrations with consistent characters and setting. Numerous approaches
attempt to solve this problem by extending the vocabulary of diffusion models
through fine-tuning. However, even lightweight fine-tuning approaches can be
prohibitively expensive to run at scale and in real-time. We introduce a method
called ObjectComposer for generating compositions of multiple objects that
resemble user-specified images. Our approach is training-free, leveraging the
abilities of preexisting models. We build upon the recent BLIP-Diffusion model,
which can generate images of single objects specified by reference images.
ObjectComposer enables the consistent generation of compositions containing
multiple specific objects simultaneously, all without modifying the weights of
the underlying models.
Related papers
- SINGAPO: Single Image Controlled Generation of Articulated Parts in Objects [20.978091381109294]
We propose a method to generate articulated objects from a single image.
Our method generates an articulated object that is visually consistent with the input image.
Our experiments show that our method outperforms the state-of-the-art in articulated object creation.
arXiv Detail & Related papers (2024-10-21T20:41:32Z) - Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation [10.416673784744281]
We propose a weighted-merge method to merge multiple reference image features into corresponding objects.
Our method achieves superior performance to the state-of-the-arts on the Concept101 dataset and DreamBooth dataset of multi-object personalized image generation.
arXiv Detail & Related papers (2024-09-26T15:04:13Z) - Iterative Object Count Optimization for Text-to-image Diffusion Models [59.03672816121209]
Current models, which learn from image-text pairs, inherently struggle with counting.
We propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an object's potential.
We evaluate the generation of various objects and show significant improvements in accuracy.
arXiv Detail & Related papers (2024-08-21T15:51:46Z) - Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with
Prototypical Embedding [7.893308498886083]
Our proposed method aims to address the challenges of generalizability and fidelity in an object-driven way.
A prototypical embedding is based on the object's appearance and its class, before fine-tuning the diffusion model.
Our method outperforms several existing works.
arXiv Detail & Related papers (2024-01-28T17:11:42Z) - Unlocking Spatial Comprehension in Text-to-Image Diffusion Models [33.99474729408903]
CompFuser is an image generation pipeline that enhances spatial comprehension and attribute assignment in text-to-image generative models.
Our pipeline enables the interpretation of instructions defining spatial relationships between objects in a scene.
arXiv Detail & Related papers (2023-11-28T19:00:02Z) - LLM Blueprint: Enabling Text-to-Image Generation with Complex and
Detailed Prompts [60.54912319612113]
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts.
We present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts.
Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models.
arXiv Detail & Related papers (2023-10-16T17:57:37Z) - Diffusion Self-Guidance for Controllable Image Generation [106.59989386924136]
Self-guidance provides greater control over generated images by guiding the internal representations of diffusion models.
We show how a simple set of properties can be composed to perform challenging image manipulations.
We also show that self-guidance can be used to edit real images.
arXiv Detail & Related papers (2023-06-01T17:59:56Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - Context-Aware Layout to Image Generation with Enhanced Object Appearance [123.62597976732948]
A layout to image (L2I) generation model aims to generate a complicated image containing multiple objects (things) against natural background (stuff)
Existing L2I models have made great progress, but object-to-object and object-to-stuff relations are often broken.
We argue that these are caused by the lack of context-aware object and stuff feature encoding in their generators, and location-sensitive appearance representation in their discriminators.
arXiv Detail & Related papers (2021-03-22T14:43:25Z) - Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects.
Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity.
We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.