AnyMaker: Zero-shot General Object Customization via Decoupled Dual-Level ID Injection
- URL: http://arxiv.org/abs/2406.11643v3
- Date: Fri, 5 Jul 2024 13:10:51 GMT
- Title: AnyMaker: Zero-shot General Object Customization via Decoupled Dual-Level ID Injection
- Authors: Lingjie Kong, Kai Wu, Xiaobin Hu, Wenhui Han, Jinlong Peng, Chengming Xu, Donghao Luo, Jiangning Zhang, Chengjie Wang, Yanwei Fu,
- Abstract summary: We introduce AnyMaker, a framework capable of generating general objects with high ID fidelity and flexible text editability.
The efficacy of AnyMaker stems from its novel general ID extraction, dual-level ID injection, and ID-aware decoupling.
To validate our approach and boost the research of general object customization, we create the first large-scale general ID dataset.
- Score: 72.41427550339296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image based object customization, aiming to generate images with the same identity (ID) as objects of interest in accordance with text prompts and reference images, has made significant progress. However, recent customizing research is dominated by specialized tasks, such as human customization or virtual try-on, leaving a gap in general object customization. To this end, we introduce AnyMaker, an innovative zero-shot object customization framework capable of generating general objects with high ID fidelity and flexible text editability. The efficacy of AnyMaker stems from its novel general ID extraction, dual-level ID injection, and ID-aware decoupling. Specifically, the general ID extraction module extracts sufficient ID information with an ensemble of self-supervised models to tackle the diverse customization tasks for general objects. Then, to provide the diffusion UNet with the extracted ID as much while not damaging the text editability in the generation process, we design a global-local dual-level ID injection module, in which the global-level semantic ID is injected into text descriptions while the local-level ID details are injected directly into the model through newly added cross-attention modules. In addition, we propose an ID-aware decoupling module to disentangle ID-related information from non-ID elements in the extracted representations for high-fidelity generation of both identity and text descriptions. To validate our approach and boost the research of general object customization, we create the first large-scale general ID dataset, Multi-Category ID-Consistent (MC-IDC) dataset, with 315k text-image samples and 10k categories. Experiments show that AnyMaker presents remarkable performance in general object customization and outperforms specialized methods in corresponding tasks. Code and dataset will be released soon.
Related papers
- AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation [14.68987039472664]
We propose AnyStory, a unified approach for personalized subject generation.
AnyStory achieves high-fidelity personalization for single subjects, but also for multiple subjects, without sacrificing subject fidelity.
arXiv Detail & Related papers (2025-01-16T12:28:39Z) - DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting [63.01425442236011]
We present DreamMix, a diffusion-based generative model adept at inserting target objects into scenes at user-specified locations.
We propose an Attribute Decoupling Mechanism (ADM) and a Textual Attribute Substitution (TAS) module to improve the diversity and discriminative capability of the text-based attribute guidance.
arXiv Detail & Related papers (2024-11-26T08:44:47Z) - UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization [10.760799194716922]
UniPortrait is an innovative human image personalization framework that unifies single- and multi-ID customization.
UniPortrait consists of only two plug-and-play modules: an ID embedding module and an ID routing module.
arXiv Detail & Related papers (2024-08-12T06:27:29Z) - Customizing Text-to-Image Diffusion with Object Viewpoint Control [53.621518249820745]
We introduce a new task -- enabling explicit control of the object viewpoint in the customization of text-to-image diffusion models.
This allows us to modify the custom object's properties and generate it in various background scenes via text prompts.
We propose to condition the diffusion process on the 3D object features rendered from the target viewpoint.
arXiv Detail & Related papers (2024-04-18T16:59:51Z) - LLM Blueprint: Enabling Text-to-Image Generation with Complex and
Detailed Prompts [60.54912319612113]
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts.
We present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts.
Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models.
arXiv Detail & Related papers (2023-10-16T17:57:37Z) - Conditional Cross Attention Network for Multi-Space Embedding without
Entanglement in Only a SINGLE Network [1.8899300124593648]
We propose a Conditional Cross-Attention Network that induces disentangled multi-space embeddings for various specific attributes with only a single backbone.
Our proposed method achieved consistent state-of-the-art performance on the FashionAI, DARN, DeepFashion, and Zappos50K benchmark datasets.
arXiv Detail & Related papers (2023-07-25T04:48:03Z) - Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning [6.288699905490906]
We propose Subject-Diffusion, a novel open-domain personalized image generation model.
Our method outperforms other SOTA frameworks in single, multiple, and human customized image generation.
arXiv Detail & Related papers (2023-07-21T08:09:47Z) - AnyDoor: Zero-shot Object-level Image Customization [63.44307304097742]
This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations.
Our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage.
arXiv Detail & Related papers (2023-07-18T17:59:02Z) - High-Quality Entity Segmentation [110.55724145851725]
CropFormer is designed to tackle the intractability of instance-level segmentation on high-resolution images.
It improves mask prediction by fusing high-res image crops that provide more fine-grained image details and the full image.
With CropFormer, we achieve a significant AP gain of $1.9$ on the challenging entity segmentation task.
arXiv Detail & Related papers (2022-11-10T18:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.