Learning to Compose Visual Relations
- URL: http://arxiv.org/abs/2111.09297v1
- Date: Wed, 17 Nov 2021 18:51:29 GMT
- Title: Learning to Compose Visual Relations
- Authors: Nan Liu, Shuang Li, Yilun Du, Joshua B. Tenenbaum, Antonio Torralba
- Abstract summary: We propose to represent each relation as an unnormalized density (an energy-based model)
We show that such a factorized decomposition allows the model to both generate and edit scenes with multiple sets of relations more faithfully.
- Score: 100.45138490076866
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The visual world around us can be described as a structured set of objects
and their associated relations. An image of a room may be conjured given only
the description of the underlying objects and their associated relations. While
there has been significant work on designing deep neural networks which may
compose individual objects together, less work has been done on composing the
individual relations between objects. A principal difficulty is that while the
placement of objects is mutually independent, their relations are entangled and
dependent on each other. To circumvent this issue, existing works primarily
compose relations by utilizing a holistic encoder, in the form of text or
graphs. In this work, we instead propose to represent each relation as an
unnormalized density (an energy-based model), enabling us to compose separate
relations in a factorized manner. We show that such a factorized decomposition
allows the model to both generate and edit scenes that have multiple sets of
relations more faithfully. We further show that decomposition enables our model
to effectively understand the underlying relational scene structure. Project
page at: https://composevisualrelations.github.io/.
Related papers
- RelationBooth: Towards Relation-Aware Customized Object Generation [32.762475563341525]
We introduce RelationBooth, a framework that disentangles identity and relation learning through a well-curated dataset.
Our training data consists of relation-specific images, independent object images containing identity information, and text prompts to guide relation generation.
First, we introduce a keypoint matching loss that effectively guides the model in adjusting object poses closely tied to their relationships.
Second, we incorporate local features from the image prompts to better distinguish between objects, preventing confusion in overlapping cases.
arXiv Detail & Related papers (2024-10-30T17:57:21Z) - Relation Rectification in Diffusion Model [64.84686527988809]
We introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate.
We propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN)
The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space.
arXiv Detail & Related papers (2024-03-29T15:54:36Z) - STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning [4.676784872259775]
We propose a large-scale video dataset for understanding spatial relationships derived from prepositions of the English language.
The dataset contains 150K visual depictions (videos and images), consisting of 30 distinct spatial prepositional senses.
In addition to spatial relations, we also propose 50K visual depictions across 10 temporal relations, consisting of videos depicting event/time-point interactions.
arXiv Detail & Related papers (2023-09-13T02:35:59Z) - Learning Attention Propagation for Compositional Zero-Shot Learning [71.55375561183523]
We propose a novel method called Compositional Attention Propagated Embedding (CAPE)
CAPE learns to identify this structure and propagates knowledge between them to learn class embedding for all seen and unseen compositions.
We show that our method outperforms previous baselines to set a new state-of-the-art on three publicly available benchmarks.
arXiv Detail & Related papers (2022-10-20T19:44:11Z) - ViRel: Unsupervised Visual Relations Discovery with Graph-level Analogy [65.5580334698777]
ViRel is a method for unsupervised discovery and learning of Visual Relations with graph-level analogy.
We show that our method achieves above 95% accuracy in relation classification.
We further generalizes to unseen tasks with more complicated relational structures.
arXiv Detail & Related papers (2022-07-04T16:56:45Z) - Transformer-based Dual Relation Graph for Multi-label Image Recognition [56.12543717723385]
We propose a novel Transformer-based Dual Relation learning framework.
We explore two aspects of correlation, i.e., structural relation graph and semantic relation graph.
Our approach achieves new state-of-the-art on two popular multi-label recognition benchmarks.
arXiv Detail & Related papers (2021-10-10T07:14:52Z) - Exploiting Relationship for Complex-scene Image Generation [43.022978211274065]
This work explores relationship-aware complex-scene image generation, where multiple objects are inter-related as a scene graph.
We propose three major updates in the generation framework. First, reasonable spatial layouts are inferred by jointly considering the semantics and relationships among objects.
Second, since the relations between objects significantly influence an object's appearance, we design a relation-guided generator to generate objects reflecting their relationships.
Third, a novel scene graph discriminator is proposed to guarantee the consistency between the generated image and the input scene graph.
arXiv Detail & Related papers (2021-04-01T09:21:39Z) - RELATE: Physically Plausible Multi-Object Scene Synthesis Using
Structured Latent Spaces [77.07767833443256]
We present RELATE, a model that learns to generate physically plausible scenes and videos of multiple interacting objects.
In contrast to state-of-the-art methods in object-centric generative modeling, RELATE also extends naturally to dynamic scenes and generates videos of high visual fidelity.
arXiv Detail & Related papers (2020-07-02T17:27:27Z) - Structured Query-Based Image Retrieval Using Scene Graphs [10.475553340127394]
We present a method which uses scene graph embeddings as the basis for an approach to image retrieval.
We are able to achieve high recall even on low to medium frequency objects found in the long-tailed COCO-Stuff dataset.
arXiv Detail & Related papers (2020-05-13T22:40:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.