ReVersion: Diffusion-Based Relation Inversion from Images
- URL: http://arxiv.org/abs/2303.13495v1
- Date: Thu, 23 Mar 2023 17:56:10 GMT
- Title: ReVersion: Diffusion-Based Relation Inversion from Images
- Authors: Ziqi Huang, Tianxing Wu, Yuming Jiang, Kelvin C.K. Chan, Ziwei Liu
- Abstract summary: We propose ReVersion for the Relation Inversion task, which aims to learn a specific relation from exemplar images.
We learn a relation prompt from a frozen pre-trained text-to-image diffusion model.
The learned relation prompt can then be applied to generate relation-specific images with new objects, backgrounds, and styles.
- Score: 31.61407278439991
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models gain increasing popularity for their generative
capabilities. Recently, there have been surging needs to generate customized
images by inverting diffusion models from exemplar images. However, existing
inversion methods mainly focus on capturing object appearances. How to invert
object relations, another important pillar in the visual world, remains
unexplored. In this work, we propose ReVersion for the Relation Inversion task,
which aims to learn a specific relation (represented as "relation prompt") from
exemplar images. Specifically, we learn a relation prompt from a frozen
pre-trained text-to-image diffusion model. The learned relation prompt can then
be applied to generate relation-specific images with new objects, backgrounds,
and styles. Our key insight is the "preposition prior" - real-world relation
prompts can be sparsely activated upon a set of basis prepositional words.
Specifically, we propose a novel relation-steering contrastive learning scheme
to impose two critical properties of the relation prompt: 1) The relation
prompt should capture the interaction between objects, enforced by the
preposition prior. 2) The relation prompt should be disentangled away from
object appearances. We further devise relation-focal importance sampling to
emphasize high-level interactions over low-level appearances (e.g., texture,
color). To comprehensively evaluate this new task, we contribute ReVersion
Benchmark, which provides various exemplar images with diverse relations.
Extensive experiments validate the superiority of our approach over existing
methods across a wide range of visual relations.
Related papers
- RelationBooth: Towards Relation-Aware Customized Object Generation [32.762475563341525]
We introduce RelationBooth, a framework that disentangles identity and relation learning through a well-curated dataset.
Our training data consists of relation-specific images, independent object images containing identity information, and text prompts to guide relation generation.
First, we introduce a keypoint matching loss that effectively guides the model in adjusting object poses closely tied to their relationships.
Second, we incorporate local features from the image prompts to better distinguish between objects, preventing confusion in overlapping cases.
arXiv Detail & Related papers (2024-10-30T17:57:21Z) - Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding [42.10086029931937]
Visual grounding aims to localize the object referred to in an image based on a natural language query.
Existing methods demonstrate a significant performance drop when there are multiple distractions in an image.
We propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue.
arXiv Detail & Related papers (2024-08-29T07:32:01Z) - Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - Relation Rectification in Diffusion Model [64.84686527988809]
We introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate.
We propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN)
The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space.
arXiv Detail & Related papers (2024-03-29T15:54:36Z) - Visual Commonsense based Heterogeneous Graph Contrastive Learning [79.22206720896664]
We propose a heterogeneous graph contrastive learning method to better finish the visual reasoning task.
Our method is designed as a plug-and-play way, so that it can be quickly and easily combined with a wide range of representative methods.
arXiv Detail & Related papers (2023-11-11T12:01:18Z) - Dual Relation Alignment for Composed Image Retrieval [24.812654620141778]
We argue for the existence of two types of relations in composed image retrieval.
The explicit relation pertains to the reference image & complementary text-target image.
We propose a new framework for composed image retrieval, termed dual relation alignment.
arXiv Detail & Related papers (2023-09-05T12:16:14Z) - Object-aware Contrastive Learning for Debiased Scene Representation [74.30741492814327]
We develop a novel object-aware contrastive learning framework that localizes objects in a self-supervised manner.
We also introduce two data augmentations based on ContraCAM, object-aware random crop and background mixup, which reduce contextual and background biases during contrastive self-supervised learning.
arXiv Detail & Related papers (2021-07-30T19:24:07Z) - ORD: Object Relationship Discovery for Visual Dialogue Generation [60.471670447176656]
We propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation.
A hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally.
Experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships.
arXiv Detail & Related papers (2020-06-15T12:25:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.