Iterative Scene Graph Generation with Generative Transformers
- URL: http://arxiv.org/abs/2211.16636v1
- Date: Wed, 30 Nov 2022 00:05:44 GMT
- Title: Iterative Scene Graph Generation with Generative Transformers
- Authors: Sanjoy Kundu and Sathyanarayanan N. Aakur
- Abstract summary: Scene graphs provide a rich, structured representation of a scene by encoding the entities (objects) and their spatial relationships in a graphical format.
Current approaches take a generation-by-classification approach where the scene graph is generated through labeling of all possible edges between objects in a scene.
This work introduces a generative transformer-based approach to generating scene graphs beyond link prediction.
- Score: 6.243995448840211
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene graphs provide a rich, structured representation of a scene by encoding
the entities (objects) and their spatial relationships in a graphical format.
This representation has proven useful in several tasks, such as question
answering, captioning, and even object detection, to name a few. Current
approaches take a generation-by-classification approach where the scene graph
is generated through labeling of all possible edges between objects in a scene,
which adds computational overhead to the approach. This work introduces a
generative transformer-based approach to generating scene graphs beyond link
prediction. Using two transformer-based components, we first sample a possible
scene graph structure from detected objects and their visual features. We then
perform predicate classification on the sampled edges to generate the final
scene graph. This approach allows us to efficiently generate scene graphs from
images with minimal inference overhead. Extensive experiments on the Visual
Genome dataset demonstrate the efficiency of the proposed approach. Without
bells and whistles, we obtain, on average, 20.7% mean recall (mR@100) across
different settings for scene graph generation (SGG), outperforming
state-of-the-art SGG approaches while offering competitive performance to
unbiased SGG approaches.
Related papers
- FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors.
First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness.
Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z) - Location-Free Scene Graph Generation [45.366540803729386]
Scene Graph Generation (SGG) is a visual understanding task, aiming to describe a scene as a graph of entities and their relationships with each other.
Existing works rely on location labels in form of bounding boxes or segmentation masks, increasing annotation costs and limiting dataset expansion.
We break this dependency and introduce location-free scene graph generation (LF-SGG)
This new task aims at predicting instances of entities, as well as their relationships, without the explicit calculation of their spatial localization.
arXiv Detail & Related papers (2023-03-20T08:57:45Z) - Diffusion-Based Scene Graph to Image Generation with Masked Contrastive
Pre-Training [112.94542676251133]
We propose to learn scene graph embeddings by directly optimizing their alignment with images.
Specifically, we pre-train an encoder to extract both global and local information from scene graphs.
The resulting method, called SGDiff, allows for the semantic manipulation of generated images by modifying scene graph nodes and connections.
arXiv Detail & Related papers (2022-11-21T01:11:19Z) - Iterative Scene Graph Generation [55.893695946885174]
Scene graph generation involves identifying object entities and their corresponding interaction predicates in a given image (or video)
Existing approaches to scene graph generation assume certain factorization of the joint distribution to make the estimation iteration feasible.
We propose a novel framework that addresses this limitation, as well as introduces dynamic conditioning on the image.
arXiv Detail & Related papers (2022-07-27T10:37:29Z) - Learning to Generate Scene Graph from Natural Language Supervision [52.18175340725455]
We propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph.
We leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
arXiv Detail & Related papers (2021-09-06T03:38:52Z) - Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes Using
Scene Graphs [85.54212143154986]
Controllable scene synthesis consists of generating 3D information that satisfy underlying specifications.
Scene graphs are representations of a scene composed of objects (nodes) and inter-object relationships (edges)
We propose the first work that directly generates shapes from a scene graph in an end-to-end manner.
arXiv Detail & Related papers (2021-08-19T17:59:07Z) - Unconditional Scene Graph Generation [72.53624470737712]
We develop a deep auto-regressive model called SceneGraphGen which can learn the probability distribution over labelled and directed graphs.
We show that the scene graphs generated by SceneGraphGen are diverse and follow the semantic patterns of real-world scenes.
arXiv Detail & Related papers (2021-08-12T17:57:16Z) - Segmentation-grounded Scene Graph Generation [47.34166260639392]
We propose a framework for pixel-level segmentation-grounded scene graph generation.
Our framework is agnostic to the underlying scene graph generation method.
It is learned in a multi-task manner with both target and auxiliary datasets.
arXiv Detail & Related papers (2021-04-29T08:54:08Z) - Fully Convolutional Scene Graph Generation [30.194961716870186]
This paper presents a fully convolutional scene graph generation (FCSGG) model that detects objects and relations simultaneously.
FCSGG encodes objects as bounding box center points, and relationships as 2D vector fields which are named as Relation Affinity Fields (RAFs)
FCSGG achieves highly competitive results on recall and zero-shot recall with significantly reduced inference time.
arXiv Detail & Related papers (2021-03-30T05:25:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.