Diffusion-Based Scene Graph to Image Generation with Masked Contrastive
Pre-Training
- URL: http://arxiv.org/abs/2211.11138v1
- Date: Mon, 21 Nov 2022 01:11:19 GMT
- Title: Diffusion-Based Scene Graph to Image Generation with Masked Contrastive
Pre-Training
- Authors: Ling Yang, Zhilin Huang, Yang Song, Shenda Hong, Guohao Li, Wentao
Zhang, Bin Cui, Bernard Ghanem, Ming-Hsuan Yang
- Abstract summary: We propose to learn scene graph embeddings by directly optimizing their alignment with images.
Specifically, we pre-train an encoder to extract both global and local information from scene graphs.
The resulting method, called SGDiff, allows for the semantic manipulation of generated images by modifying scene graph nodes and connections.
- Score: 112.94542676251133
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating images from graph-structured inputs, such as scene graphs, is
uniquely challenging due to the difficulty of aligning nodes and connections in
graphs with objects and their relations in images. Most existing methods
address this challenge by using scene layouts, which are image-like
representations of scene graphs designed to capture the coarse structures of
scene images. Because scene layouts are manually crafted, the alignment with
images may not be fully optimized, causing suboptimal compliance between the
generated images and the original scene graphs. To tackle this issue, we
propose to learn scene graph embeddings by directly optimizing their alignment
with images. Specifically, we pre-train an encoder to extract both global and
local information from scene graphs that are predictive of the corresponding
images, relying on two loss functions: masked autoencoding loss and contrastive
loss. The former trains embeddings by reconstructing randomly masked image
regions, while the latter trains embeddings to discriminate between compliant
and non-compliant images according to the scene graph. Given these embeddings,
we build a latent diffusion model to generate images from scene graphs. The
resulting method, called SGDiff, allows for the semantic manipulation of
generated images by modifying scene graph nodes and connections. On the Visual
Genome and COCO-Stuff datasets, we demonstrate that SGDiff outperforms
state-of-the-art methods, as measured by both the Inception Score and Fr\'echet
Inception Distance (FID) metrics. We will release our source code and trained
models at https://github.com/YangLing0818/SGDiff.
Related papers
- Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation [44.457347230146404]
We leverage the scene graph, a powerful structured representation, for complex image generation.
We employ the generative capabilities of variational autoencoders and diffusion models in a generalizable manner.
Our method outperforms recent competitors based on text, layout, or scene graph.
arXiv Detail & Related papers (2024-10-01T07:02:46Z) - Image Synthesis with Graph Conditioning: CLIP-Guided Diffusion Models for Scene Graphs [0.0]
We introduce a novel approach to generate images from scene graphs.
We leverage pre-trained text-to-image diffusion models and CLIP guidance to translate graph knowledge into images.
Elaborate experiments reveal that our method outperforms existing methods on standard benchmarks.
arXiv Detail & Related papers (2024-01-25T11:46:31Z) - FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors.
First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness.
Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z) - SPAN: Learning Similarity between Scene Graphs and Images with Transformers [29.582313604112336]
We propose a Scene graPh-imAge coNtrastive learning framework, SPAN, that can measure the similarity between scene graphs and images.
We introduce a novel graph serialization technique that transforms a scene graph into a sequence with structural encodings.
arXiv Detail & Related papers (2023-04-02T18:13:36Z) - MIGS: Meta Image Generation from Scene Graphs [48.82382997154196]
We propose MIGS (Meta Image Generation from Scene Graphs), a meta-learning based approach for few-shot image generation from graphs.
By sampling the data in a task-driven fashion, we train the generator using meta-learning on different sets of tasks that are categorized based on the scene attributes.
Our results show that using this meta-learning approach for the generation of images from scene graphs state-of-the-art performance in terms of image quality and capturing the semantic relationships in the scene.
arXiv Detail & Related papers (2021-10-22T17:02:44Z) - Learning to Generate Scene Graph from Natural Language Supervision [52.18175340725455]
We propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph.
We leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
arXiv Detail & Related papers (2021-09-06T03:38:52Z) - Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes Using
Scene Graphs [85.54212143154986]
Controllable scene synthesis consists of generating 3D information that satisfy underlying specifications.
Scene graphs are representations of a scene composed of objects (nodes) and inter-object relationships (edges)
We propose the first work that directly generates shapes from a scene graph in an end-to-end manner.
arXiv Detail & Related papers (2021-08-19T17:59:07Z) - Unconditional Scene Graph Generation [72.53624470737712]
We develop a deep auto-regressive model called SceneGraphGen which can learn the probability distribution over labelled and directed graphs.
We show that the scene graphs generated by SceneGraphGen are diverse and follow the semantic patterns of real-world scenes.
arXiv Detail & Related papers (2021-08-12T17:57:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.