Transforming Image Generation from Scene Graphs
- URL: http://arxiv.org/abs/2207.00545v1
- Date: Fri, 1 Jul 2022 16:59:38 GMT
- Title: Transforming Image Generation from Scene Graphs
- Authors: Renato Sortino, Simone Palazzo, Concetto Spampinato
- Abstract summary: We propose a transformer-based approach conditioned by scene graphs that employs a decoder to autoregressively compose images.
The proposed architecture is composed by three modules: 1) a graph convolutional network, to encode the relationships of the input graph; 2) an encoder-decoder transformer, which autoregressively composes the output image; 3) an auto-encoder, employed to generate representations used as input/output of each generation step by the transformer.
- Score: 11.443097632746763
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating images from semantic visual knowledge is a challenging task, that
can be useful to condition the synthesis process in complex, subtle, and
unambiguous ways, compared to alternatives such as class labels or text
descriptions. Although generative methods conditioned by semantic
representations exist, they do not provide a way to control the generation
process aside from the specification of constraints between objects. As an
example, the possibility to iteratively generate or modify images by manually
adding specific items is a desired property that, to our knowledge, has not
been fully investigated in the literature. In this work we propose a
transformer-based approach conditioned by scene graphs that, conversely to
recent transformer-based methods, also employs a decoder to autoregressively
compose images, making the synthesis process more effective and controllable.
The proposed architecture is composed by three modules: 1) a graph
convolutional network, to encode the relationships of the input graph; 2) an
encoder-decoder transformer, which autoregressively composes the output image;
3) an auto-encoder, employed to generate representations used as input/output
of each generation step by the transformer. Results obtained on CIFAR10 and
MNIST images show that our model is able to satisfy semantic constraints
defined by a scene graph and to model relations between visual objects in the
scene by taking into account a user-provided partial rendering of the desired
target.
Related papers
- Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation [44.457347230146404]
We leverage the scene graph, a powerful structured representation, for complex image generation.
We employ the generative capabilities of variational autoencoders and diffusion models in a generalizable manner.
Our method outperforms recent competitors based on text, layout, or scene graph.
arXiv Detail & Related papers (2024-10-01T07:02:46Z) - Iterative Object Count Optimization for Text-to-image Diffusion Models [59.03672816121209]
Current models, which learn from image-text pairs, inherently struggle with counting.
We propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an object's potential.
We evaluate the generation of various objects and show significant improvements in accuracy.
arXiv Detail & Related papers (2024-08-21T15:51:46Z) - Cross-Image Attention for Zero-Shot Appearance Transfer [68.43651329067393]
We introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images.
We harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process.
Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint.
arXiv Detail & Related papers (2023-11-06T18:33:24Z) - Object-Centric Relational Representations for Image Generation [18.069747511100132]
This paper explores a novel method to condition image generation, based on object-centric relational representations.
We show that such architectural biases entail properties that facilitate the manipulation and conditioning of the generative process.
We also propose a novel benchmark for image generation consisting of a synthetic dataset of images paired with their relational representation.
arXiv Detail & Related papers (2023-03-26T11:17:17Z) - Transformer-based Image Generation from Scene Graphs [11.443097632746763]
Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image.
Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation.
We show how employing multi-head attention to encode the graph information can improve the quality of the sampled data.
arXiv Detail & Related papers (2023-03-08T14:54:51Z) - Iterative Scene Graph Generation [55.893695946885174]
Scene graph generation involves identifying object entities and their corresponding interaction predicates in a given image (or video)
Existing approaches to scene graph generation assume certain factorization of the joint distribution to make the estimation iteration feasible.
We propose a novel framework that addresses this limitation, as well as introduces dynamic conditioning on the image.
arXiv Detail & Related papers (2022-07-27T10:37:29Z) - ReFormer: The Relational Transformer for Image Captioning [12.184772369145014]
Image captioning is shown to be able to achieve a better performance by using scene graphs to represent the relations of objects in the image.
We propose a novel architecture ReFormer to generate features with relation information embedded.
Our model significantly outperforms state-of-the-art methods on image captioning and scene graph generation.
arXiv Detail & Related papers (2021-07-29T17:03:36Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Semantic Image Manipulation Using Scene Graphs [105.03614132953285]
We introduce a-semantic scene graph network that does not require direct supervision for constellation changes or image edits.
This makes possible to train the system from existing real-world datasets with no additional annotation effort.
arXiv Detail & Related papers (2020-04-07T20:02:49Z) - Fine-grained Image-to-Image Transformation towards Visual Recognition [102.51124181873101]
We aim at transforming an image with a fine-grained category to synthesize new images that preserve the identity of the input image.
We adopt a model based on generative adversarial networks to disentangle the identity related and unrelated factors of an image.
Experiments on the CompCars and Multi-PIE datasets demonstrate that our model preserves the identity of the generated images much better than the state-of-the-art image-to-image transformation models.
arXiv Detail & Related papers (2020-01-12T05:26:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.