ReVersion: Diffusion-Based Relation Inversion from Images
- URL: http://arxiv.org/abs/2303.13495v1
- Date: Thu, 23 Mar 2023 17:56:10 GMT
- Title: ReVersion: Diffusion-Based Relation Inversion from Images
- Authors: Ziqi Huang, Tianxing Wu, Yuming Jiang, Kelvin C.K. Chan, Ziwei Liu
- Abstract summary: We propose ReVersion for the Relation Inversion task, which aims to learn a specific relation from exemplar images.
We learn a relation prompt from a frozen pre-trained text-to-image diffusion model.
The learned relation prompt can then be applied to generate relation-specific images with new objects, backgrounds, and styles.
- Score: 31.61407278439991
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models gain increasing popularity for their generative
capabilities. Recently, there have been surging needs to generate customized
images by inverting diffusion models from exemplar images. However, existing
inversion methods mainly focus on capturing object appearances. How to invert
object relations, another important pillar in the visual world, remains
unexplored. In this work, we propose ReVersion for the Relation Inversion task,
which aims to learn a specific relation (represented as "relation prompt") from
exemplar images. Specifically, we learn a relation prompt from a frozen
pre-trained text-to-image diffusion model. The learned relation prompt can then
be applied to generate relation-specific images with new objects, backgrounds,
and styles. Our key insight is the "preposition prior" - real-world relation
prompts can be sparsely activated upon a set of basis prepositional words.
Specifically, we propose a novel relation-steering contrastive learning scheme
to impose two critical properties of the relation prompt: 1) The relation
prompt should capture the interaction between objects, enforced by the
preposition prior. 2) The relation prompt should be disentangled away from
object appearances. We further devise relation-focal importance sampling to
emphasize high-level interactions over low-level appearances (e.g., texture,
color). To comprehensively evaluate this new task, we contribute ReVersion
Benchmark, which provides various exemplar images with diverse relations.
Extensive experiments validate the superiority of our approach over existing
methods across a wide range of visual relations.
Related papers
- Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation [70.95783968368124]
We introduce a novel multi-modal autoregressive model, dubbed $textbfInstaManip$.
We propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages.
Our method surpasses previous few-shot image manipulation models by a notable margin.
arXiv Detail & Related papers (2024-12-02T01:19:21Z) - Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding [42.10086029931937]
Visual grounding aims to localize the object referred to in an image based on a natural language query.
Existing methods demonstrate a significant performance drop when there are multiple distractions in an image.
We propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue.
arXiv Detail & Related papers (2024-08-29T07:32:01Z) - Knowledge-augmented Few-shot Visual Relation Detection [25.457693302327637]
Visual Relation Detection (VRD) aims to detect relationships between objects for image understanding.
Most existing VRD methods rely on thousands of training samples of each relationship to achieve satisfactory performance.
We devise a knowledge-augmented, few-shot VRD framework leveraging both textual knowledge and visual relation knowledge.
arXiv Detail & Related papers (2023-03-09T15:38:40Z) - Objects Matter: Learning Object Relation Graph for Robust Camera
Relocalization [2.9005223064604078]
We propose to enhance the distinctiveness of the image features by extracting the deep relationship among objects.
In particular, we extract objects in the image and construct a deep object relation graph (ORG) to incorporate the semantic connections and relative spatial clues of the objects.
arXiv Detail & Related papers (2022-05-26T11:37:11Z) - Object-aware Contrastive Learning for Debiased Scene Representation [74.30741492814327]
We develop a novel object-aware contrastive learning framework that localizes objects in a self-supervised manner.
We also introduce two data augmentations based on ContraCAM, object-aware random crop and background mixup, which reduce contextual and background biases during contrastive self-supervised learning.
arXiv Detail & Related papers (2021-07-30T19:24:07Z) - Visual Relationship Detection with Visual-Linguistic Knowledge from
Multimodal Representations [103.00383924074585]
Visual relationship detection aims to reason over relationships among salient objects in images.
We propose a novel approach named Visual-Linguistic Representations from Transformers (RVL-BERT)
RVL-BERT performs spatial reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training.
arXiv Detail & Related papers (2020-09-10T16:15:09Z) - ORD: Object Relationship Discovery for Visual Dialogue Generation [60.471670447176656]
We propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation.
A hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally.
Experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships.
arXiv Detail & Related papers (2020-06-15T12:25:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.