Structured Sparse R-CNN for Direct Scene Graph Generation
- URL: http://arxiv.org/abs/2106.10815v1
- Date: Mon, 21 Jun 2021 02:24:20 GMT
- Title: Structured Sparse R-CNN for Direct Scene Graph Generation
- Authors: Yao Teng, Limin Wang
- Abstract summary: This paper presents a simple, sparse, and unified framework for relation detection, termed as Structured Sparse R-CNN.
The key to our method is a set of learnable triplet queries and structured triplet detectors which could be optimized jointly from the training set in an end-to-end manner.
We perform experiments on two benchmarks: Visual Genome and Open Images, and the results demonstrate that our method achieves the state-of-the-art performance.
- Score: 16.646937866282922
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scene graph generation (SGG) is to detect entity pairs with their relations
in an image. Existing SGG approaches often use multi-stage pipelines to
decompose this task into object detection, relation graph construction, and
dense or dense-to-sparse relation prediction. Instead, from a perspective on
SGG as a direct set prediction, this paper presents a simple, sparse, and
unified framework for relation detection, termed as Structured Sparse R-CNN.
The key to our method is a set of learnable triplet queries and structured
triplet detectors which could be jointly optimized from the training set in an
end-to-end manner. Specifically, the triplet queries encode the general prior
for entity pair locations, categories, and their relations, and provide an
initial guess of relation detection for subsequent refinement. The triplet
detector presents a cascaded dynamic head design to progressively refine the
results of relation detection. In addition, to relieve the training difficulty
of Structured Sparse R-CNN, we propose a relaxed and enhanced training strategy
based on knowledge distillation from a Siamese Sparse R-CNN. We also propose
adaptive focusing parameter and average logit approach for imbalance data
distribution. We perform experiments on two benchmarks: Visual Genome and Open
Images, and the results demonstrate that our method achieves the
state-of-the-art performance. Meanwhile, we perform in-depth ablation studies
to provide insights on our structured modeling in triplet detector design and
training strategies.
Related papers
- GraphRelate3D: Context-Dependent 3D Object Detection with Inter-Object Relationship Graphs [13.071451453118783]
We introduce an object relation module, consisting of a graph generator and a graph neural network (GNN) to learn the spatial information from certain patterns to improve 3D object detection.
Our approach improves upon the baseline PV-RCNN on the KITTI validation set for the car class across easy, moderate, and hard difficulty levels by 0.82%, 0.74%, and 0.58%, respectively.
arXiv Detail & Related papers (2024-05-10T19:18:02Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - Iterative Graph Filtering Network for 3D Human Pose Estimation [5.177947445379688]
Graph convolutional networks (GCNs) have proven to be an effective approach for 3D human pose estimation.
In this paper, we introduce an iterative graph filtering framework for 3D human pose estimation.
Our approach builds upon the idea of iteratively solving graph filtering with Laplacian regularization.
arXiv Detail & Related papers (2023-07-29T20:46:44Z) - Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased
Scene Graph Generation [62.96628432641806]
Scene Graph Generation aims to first encode the visual contents within the given image and then parse them into a compact summary graph.
We first present a novel Stacked Hybrid-Attention network, which facilitates the intra-modal refinement as well as the inter-modal interaction.
We then devise an innovative Group Collaborative Learning strategy to optimize the decoder.
arXiv Detail & Related papers (2022-03-18T09:14:13Z) - Relation Regularized Scene Graph Generation [206.76762860019065]
Scene graph generation (SGG) is built on top of detected objects to predict object pairwise visual relations.
We propose a relation regularized network (R2-Net) which can predict whether there is a relationship between two objects.
Our R2-Net can effectively refine object labels and generate scene graphs.
arXiv Detail & Related papers (2022-02-22T11:36:49Z) - Target Adaptive Context Aggregation for Video Scene Graph Generation [36.669700084337045]
This paper deals with a challenging task of video scene graph generation (VidSGG)
We present a new em detect-to-track paradigm for this task by decoupling the context modeling for relation prediction from the complicated low-level entity tracking.
arXiv Detail & Related papers (2021-08-18T12:46:28Z) - PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive
Learning [109.84770951839289]
We present PredRNN, a new recurrent network for learning visual dynamics from historical context.
We show that our approach obtains highly competitive results on three standard datasets.
arXiv Detail & Related papers (2021-03-17T08:28:30Z) - Tensor Composition Net for Visual Relationship Prediction [115.14829858763399]
We present a novel Composition Network (TCN) to predict visual relationships in images.
The key idea of our TCN is to exploit the low rank property of the visual relationship tensor.
We show our TCN's image-level visual relationship prediction provides a simple and efficient mechanism for relation-based image retrieval.
arXiv Detail & Related papers (2020-12-10T06:27:20Z) - Adaptive Graph Convolutional Network with Attention Graph Clustering for
Co-saliency Detection [35.23956785670788]
We present a novel adaptive graph convolutional network with attention graph clustering (GCAGC)
We develop an attention graph clustering algorithm to discriminate the common objects from all the salient foreground objects in an unsupervised fashion.
We evaluate our proposed GCAGC method on three cosaliency detection benchmark datasets.
arXiv Detail & Related papers (2020-03-13T09:35:59Z) - Learning to Hash with Graph Neural Networks for Recommender Systems [103.82479899868191]
Graph representation learning has attracted much attention in supporting high quality candidate search at scale.
Despite its effectiveness in learning embedding vectors for objects in the user-item interaction network, the computational costs to infer users' preferences in continuous embedding space are tremendous.
We propose a simple yet effective discrete representation learning framework to jointly learn continuous and discrete codes.
arXiv Detail & Related papers (2020-03-04T06:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.