RelTR: Relation Transformer for Scene Graph Generation
- URL: http://arxiv.org/abs/2201.11460v3
- Date: Fri, 14 Apr 2023 21:44:13 GMT
- Title: RelTR: Relation Transformer for Scene Graph Generation
- Authors: Yuren Cong, Michael Ying Yang, Bodo Rosenhahn
- Abstract summary: We propose an end-to-end scene graph generation model RelTR with an encoder-decoder architecture.
The model infers a fixed-size set of triplets subject-predicate-object using different types of attention mechanisms.
Experiments on the Visual Genome and Open Images V6 datasets demonstrate the superior performance and fast inference of our model.
- Score: 34.1193503312965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Different objects in the same scene are more or less related to each other,
but only a limited number of these relationships are noteworthy. Inspired by
DETR, which excels in object detection, we view scene graph generation as a set
prediction problem and propose an end-to-end scene graph generation model RelTR
which has an encoder-decoder architecture. The encoder reasons about the visual
feature context while the decoder infers a fixed-size set of triplets
subject-predicate-object using different types of attention mechanisms with
coupled subject and object queries. We design a set prediction loss performing
the matching between the ground truth and predicted triplets for the end-to-end
training. In contrast to most existing scene graph generation methods, RelTR is
a one-stage method that predicts a set of relationships directly only using
visual appearance without combining entities and labeling all possible
predicates. Extensive experiments on the Visual Genome and Open Images V6
datasets demonstrate the superior performance and fast inference of our model.
Related papers
- Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge [7.28830964611216]
This work introduces an enhanced approach to generating scene graphs by both a relationship hierarchy and commonsense knowledge.
We implement a robust commonsense validation pipeline that harnesses foundation models to critique the results from the scene graph prediction system.
Experiments on Visual Genome and OpenImage V6 datasets demonstrate that the proposed modules can be seamlessly integrated as plug-and-play enhancements to existing scene graph generation algorithms.
arXiv Detail & Related papers (2023-11-21T06:03:20Z) - Single-Stage Visual Relationship Learning using Conditional Queries [60.90880759475021]
TraCQ is a new formulation for scene graph generation that avoids the multi-task learning problem and the entity pair distribution.
We employ a DETR-based encoder-decoder conditional queries to significantly reduce the entity label space as well.
Experimental results show that TraCQ not only outperforms existing single-stage scene graph generation methods, it also beats many state-of-the-art two-stage methods on the Visual Genome dataset.
arXiv Detail & Related papers (2023-06-09T06:02:01Z) - CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph
Diffusion [83.30168660888913]
We present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes.
Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes.
The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model.
arXiv Detail & Related papers (2023-05-25T17:39:13Z) - Self-Supervised Relation Alignment for Scene Graph Generation [44.3983804479146]
We introduce a self-supervised relational alignment regularization to improve scene graph generation performance.
The proposed alignment is general and can be combined with any existing scene graph generation framework.
We illustrate the effectiveness of this self-supervised relational alignment in conjunction with two scene graph generation architectures.
arXiv Detail & Related papers (2023-02-02T20:34:13Z) - SrTR: Self-reasoning Transformer with Visual-linguistic Knowledge for
Scene Graph Generation [12.977857322594206]
One-stage scene graph generation approaches infer the effective relation between entity pairs using sparse proposal sets and a few queries.
A Self-reasoning Transformer with Visual-linguistic Knowledge (SrTR) is proposed to add flexible self-reasoning ability to the model.
Inspired by the large-scale pre-training image-text foundation models, visual-linguistic prior knowledge is introduced.
arXiv Detail & Related papers (2022-12-19T09:47:27Z) - Iterative Scene Graph Generation with Generative Transformers [6.243995448840211]
Scene graphs provide a rich, structured representation of a scene by encoding the entities (objects) and their spatial relationships in a graphical format.
Current approaches take a generation-by-classification approach where the scene graph is generated through labeling of all possible edges between objects in a scene.
This work introduces a generative transformer-based approach to generating scene graphs beyond link prediction.
arXiv Detail & Related papers (2022-11-30T00:05:44Z) - Scene Graph Modification as Incremental Structure Expanding [61.84291817776118]
We focus on scene graph modification (SGM), where the system is required to learn how to update an existing scene graph based on a natural language query.
We frame SGM as a graph expansion task by introducing the incremental structure expanding (ISE)
We construct a challenging dataset that contains more complicated queries and larger scene graphs than existing datasets.
arXiv Detail & Related papers (2022-09-15T16:26:14Z) - Iterative Scene Graph Generation [55.893695946885174]
Scene graph generation involves identifying object entities and their corresponding interaction predicates in a given image (or video)
Existing approaches to scene graph generation assume certain factorization of the joint distribution to make the estimation iteration feasible.
We propose a novel framework that addresses this limitation, as well as introduces dynamic conditioning on the image.
arXiv Detail & Related papers (2022-07-27T10:37:29Z) - Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes Using
Scene Graphs [85.54212143154986]
Controllable scene synthesis consists of generating 3D information that satisfy underlying specifications.
Scene graphs are representations of a scene composed of objects (nodes) and inter-object relationships (edges)
We propose the first work that directly generates shapes from a scene graph in an end-to-end manner.
arXiv Detail & Related papers (2021-08-19T17:59:07Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.