Transformer-based Image Generation from Scene Graphs
- URL: http://arxiv.org/abs/2303.04634v1
- Date: Wed, 8 Mar 2023 14:54:51 GMT
- Title: Transformer-based Image Generation from Scene Graphs
- Authors: Renato Sortino, Simone Palazzo, Concetto Spampinato
- Abstract summary: Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image.
Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation.
We show how employing multi-head attention to encode the graph information can improve the quality of the sampled data.
- Score: 11.443097632746763
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Graph-structured scene descriptions can be efficiently used in generative
models to control the composition of the generated image. Previous approaches
are based on the combination of graph convolutional networks and adversarial
methods for layout prediction and image generation, respectively. In this work,
we show how employing multi-head attention to encode the graph information, as
well as using a transformer-based model in the latent space for image
generation can improve the quality of the sampled data, without the need to
employ adversarial models with the subsequent advantage in terms of training
stability. The proposed approach, specifically, is entirely based on
transformer architectures both for encoding scene graphs into intermediate
object layouts and for decoding these layouts into images, passing through a
lower dimensional space learned by a vector-quantized variational autoencoder.
Our approach shows an improved image quality with respect to state-of-the-art
methods as well as a higher degree of diversity among multiple generations from
the same scene graph. We evaluate our approach on three public datasets: Visual
Genome, COCO, and CLEVR. We achieve an Inception Score of 13.7 and 12.8, and an
FID of 52.3 and 60.3, on COCO and Visual Genome, respectively. We perform
ablation studies on our contributions to assess the impact of each component.
Code is available at https://github.com/perceivelab/trf-sg2im
Related papers
- Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation [44.457347230146404]
We leverage the scene graph, a powerful structured representation, for complex image generation.
We employ the generative capabilities of variational autoencoders and diffusion models in a generalizable manner.
Our method outperforms recent competitors based on text, layout, or scene graph.
arXiv Detail & Related papers (2024-10-01T07:02:46Z) - Image Synthesis with Graph Conditioning: CLIP-Guided Diffusion Models for Scene Graphs [0.0]
We introduce a novel approach to generate images from scene graphs.
We leverage pre-trained text-to-image diffusion models and CLIP guidance to translate graph knowledge into images.
Elaborate experiments reveal that our method outperforms existing methods on standard benchmarks.
arXiv Detail & Related papers (2024-01-25T11:46:31Z) - Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object
Structure via HyperNetworks [53.67497327319569]
We introduce a novel neural rendering technique to solve image-to-3D from a single view.
Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks.
Our experiments show the advantages of our proposed approach with consistent results and rapid generation.
arXiv Detail & Related papers (2023-12-24T08:42:37Z) - SPAN: Learning Similarity between Scene Graphs and Images with Transformers [29.582313604112336]
We propose a Scene graPh-imAge coNtrastive learning framework, SPAN, that can measure the similarity between scene graphs and images.
We introduce a novel graph serialization technique that transforms a scene graph into a sequence with structural encodings.
arXiv Detail & Related papers (2023-04-02T18:13:36Z) - Iterative Scene Graph Generation with Generative Transformers [6.243995448840211]
Scene graphs provide a rich, structured representation of a scene by encoding the entities (objects) and their spatial relationships in a graphical format.
Current approaches take a generation-by-classification approach where the scene graph is generated through labeling of all possible edges between objects in a scene.
This work introduces a generative transformer-based approach to generating scene graphs beyond link prediction.
arXiv Detail & Related papers (2022-11-30T00:05:44Z) - Diffusion-Based Scene Graph to Image Generation with Masked Contrastive
Pre-Training [112.94542676251133]
We propose to learn scene graph embeddings by directly optimizing their alignment with images.
Specifically, we pre-train an encoder to extract both global and local information from scene graphs.
The resulting method, called SGDiff, allows for the semantic manipulation of generated images by modifying scene graph nodes and connections.
arXiv Detail & Related papers (2022-11-21T01:11:19Z) - Iterative Scene Graph Generation [55.893695946885174]
Scene graph generation involves identifying object entities and their corresponding interaction predicates in a given image (or video)
Existing approaches to scene graph generation assume certain factorization of the joint distribution to make the estimation iteration feasible.
We propose a novel framework that addresses this limitation, as well as introduces dynamic conditioning on the image.
arXiv Detail & Related papers (2022-07-27T10:37:29Z) - Modeling Image Composition for Complex Scene Generation [77.10533862854706]
We present a method that achieves state-of-the-art results on layout-to-image generation tasks.
After compressing RGB images into patch tokens, we propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch.
arXiv Detail & Related papers (2022-06-02T08:34:25Z) - Graph-Based 3D Multi-Person Pose Estimation Using Multi-View Images [79.70127290464514]
We decompose the task into two stages, i.e. person localization and pose estimation.
And we propose three task-specific graph neural networks for effective message passing.
Our approach achieves state-of-the-art performance on CMU Panoptic and Shelf datasets.
arXiv Detail & Related papers (2021-09-13T11:44:07Z) - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets.
For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales.
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z) - Relation Transformer Network [25.141472361426818]
We propose a novel transformer formulation for scene graph generation and relation prediction.
We leverage the encoder-decoder architecture of the transformer for rich feature embedding of nodes and edges.
Our relation prediction module classifies the directed relation from the learned node and edge embedding.
arXiv Detail & Related papers (2020-04-13T20:47:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.