RelationBooth: Towards Relation-Aware Customized Object Generation
- URL: http://arxiv.org/abs/2410.23280v2
- Date: Tue, 05 Nov 2024 05:28:46 GMT
- Title: RelationBooth: Towards Relation-Aware Customized Object Generation
- Authors: Qingyu Shi, Lu Qi, Jianzong Wu, Jinbin Bai, Jingbo Wang, Yunhai Tong, Xiangtai Li, Ming-Husan Yang,
- Abstract summary: We introduce RelationBooth, a framework that disentangles identity and relation learning through a well-curated dataset.
Our training data consists of relation-specific images, independent object images containing identity information, and text prompts to guide relation generation.
First, we introduce a keypoint matching loss that effectively guides the model in adjusting object poses closely tied to their relationships.
Second, we incorporate local features from the image prompts to better distinguish between objects, preventing confusion in overlapping cases.
- Score: 32.762475563341525
- License:
- Abstract: Customized image generation is crucial for delivering personalized content based on user-provided image prompts, aligning large-scale text-to-image diffusion models with individual needs. However, existing models often overlook the relationships between customized objects in generated images. Instead, this work addresses that gap by focusing on relation-aware customized image generation, which aims to preserve the identities from image prompts while maintaining the predicate relations described in text prompts. Specifically, we introduce RelationBooth, a framework that disentangles identity and relation learning through a well-curated dataset. Our training data consists of relation-specific images, independent object images containing identity information, and text prompts to guide relation generation. Then, we propose two key modules to tackle the two main challenges: generating accurate and natural relations, especially when significant pose adjustments are required, and avoiding object confusion in cases of overlap. First, we introduce a keypoint matching loss that effectively guides the model in adjusting object poses closely tied to their relationships. Second, we incorporate local features from the image prompts to better distinguish between objects, preventing confusion in overlapping cases. Extensive results on three benchmarks demonstrate the superiority of RelationBooth in generating precise relations while preserving object identities across a diverse set of objects and relations. The source code and trained models will be made available to the public.
Related papers
- Composing Object Relations and Attributes for Image-Text Matching [70.47747937665987]
This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges.
Our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system.
arXiv Detail & Related papers (2024-06-17T17:56:01Z) - Relation Rectification in Diffusion Model [64.84686527988809]
We introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate.
We propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN)
The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space.
arXiv Detail & Related papers (2024-03-29T15:54:36Z) - Counterfactual Image Editing [54.21104691749547]
Counterfactual image editing is an important task in generative AI, which asks how an image would look if certain features were different.
We formalize the counterfactual image editing task using formal language, modeling the causal relationships between latent generative factors and images.
We develop an efficient algorithm to generate counterfactual images by leveraging neural causal models.
arXiv Detail & Related papers (2024-02-07T20:55:39Z) - ReVersion: Diffusion-Based Relation Inversion from Images [31.61407278439991]
We propose ReVersion for the Relation Inversion task, which aims to learn a specific relation from exemplar images.
We learn a relation prompt from a frozen pre-trained text-to-image diffusion model.
The learned relation prompt can then be applied to generate relation-specific images with new objects, backgrounds, and styles.
arXiv Detail & Related papers (2023-03-23T17:56:10Z) - Relationformer: A Unified Framework for Image-to-Graph Generation [18.832626244362075]
This work proposes a unified one-stage transformer-based framework, namely Relationformer, that jointly predicts objects and their relations.
We leverage direct set-based object prediction and incorporate the interaction among the objects to learn an object-relation representation jointly.
We achieve state-of-the-art performance on multiple, diverse and multi-domain datasets.
arXiv Detail & Related papers (2022-03-19T00:36:59Z) - Learning to Compose Visual Relations [100.45138490076866]
We propose to represent each relation as an unnormalized density (an energy-based model)
We show that such a factorized decomposition allows the model to both generate and edit scenes with multiple sets of relations more faithfully.
arXiv Detail & Related papers (2021-11-17T18:51:29Z) - Exploiting Relationship for Complex-scene Image Generation [43.022978211274065]
This work explores relationship-aware complex-scene image generation, where multiple objects are inter-related as a scene graph.
We propose three major updates in the generation framework. First, reasonable spatial layouts are inferred by jointly considering the semantics and relationships among objects.
Second, since the relations between objects significantly influence an object's appearance, we design a relation-guided generator to generate objects reflecting their relationships.
Third, a novel scene graph discriminator is proposed to guarantee the consistency between the generated image and the input scene graph.
arXiv Detail & Related papers (2021-04-01T09:21:39Z) - Context-Aware Layout to Image Generation with Enhanced Object Appearance [123.62597976732948]
A layout to image (L2I) generation model aims to generate a complicated image containing multiple objects (things) against natural background (stuff)
Existing L2I models have made great progress, but object-to-object and object-to-stuff relations are often broken.
We argue that these are caused by the lack of context-aware object and stuff feature encoding in their generators, and location-sensitive appearance representation in their discriminators.
arXiv Detail & Related papers (2021-03-22T14:43:25Z) - Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects.
Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity.
We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.