Relation Transformer Network
- URL: http://arxiv.org/abs/2004.06193v2
- Date: Tue, 20 Jul 2021 21:10:56 GMT
- Title: Relation Transformer Network
- Authors: Rajat Koner, Suprosanna Shit and Volker Tresp
- Abstract summary: We propose a novel transformer formulation for scene graph generation and relation prediction.
We leverage the encoder-decoder architecture of the transformer for rich feature embedding of nodes and edges.
Our relation prediction module classifies the directed relation from the learned node and edge embedding.
- Score: 25.141472361426818
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The extraction of a scene graph with objects as nodes and mutual
relationships as edges is the basis for a deep understanding of image content.
Despite recent advances, such as message passing and joint classification, the
detection of visual relationships remains a challenging task due to sub-optimal
exploration of the mutual interaction among the visual objects. In this work,
we propose a novel transformer formulation for scene graph generation and
relation prediction. We leverage the encoder-decoder architecture of the
transformer for rich feature embedding of nodes and edges. Specifically, we
model the node-to-node interaction with the self-attention of the transformer
encoder and the edge-to-node interaction with the cross-attention of the
transformer decoder. Further, we introduce a novel positional embedding
suitable to handle edges in the decoder. Finally, our relation prediction
module classifies the directed relation from the learned node and edge
embedding. We name this architecture as Relation Transformer Network (RTN). On
the Visual Genome and GQA dataset, we have achieved an overall mean of 4.85%
and 3.1% point improvement in comparison with state-of-the-art methods. Our
experiments show that Relation Transformer can efficiently model context across
various datasets with small, medium, and large-scale relation classification.
Related papers
- Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers [59.0181939916084]
Traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries.
We propose a novel Priors Distillation (RPD) method to extract priors from the well-trained transformers on massive images.
Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification.
arXiv Detail & Related papers (2024-07-26T06:29:09Z) - Graph as Point Set [31.448841287258116]
This paper introduces a novel graph-to-set conversion method that transforms interconnected nodes into a set of independent points.
It enables using set encoders to learn from graphs, thereby significantly expanding the design space of Graph Neural Networks.
To demonstrate the effectiveness of our approach, we introduce Point Set Transformer (PST), a transformer architecture that accepts a point set converted from a graph as input.
arXiv Detail & Related papers (2024-05-05T02:29:41Z) - Graph Transformer GANs with Graph Masked Modeling for Architectural
Layout Generation [153.92387500677023]
We present a novel graph Transformer generative adversarial network (GTGAN) to learn effective graph node relations.
The proposed graph Transformer encoder combines graph convolutions and self-attentions in a Transformer to model both local and global interactions.
We also propose a novel self-guided pre-training method for graph representation learning.
arXiv Detail & Related papers (2024-01-15T14:36:38Z) - Transformer-based Image Generation from Scene Graphs [11.443097632746763]
Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image.
Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation.
We show how employing multi-head attention to encode the graph information can improve the quality of the sampled data.
arXiv Detail & Related papers (2023-03-08T14:54:51Z) - Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision.
A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive.
We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z) - Graph Reasoning Transformer for Image Parsing [67.76633142645284]
We propose a novel Graph Reasoning Transformer (GReaT) for image parsing to enable image patches to interact following a relation reasoning pattern.
Compared to the conventional transformer, GReaT has higher interaction efficiency and a more purposeful interaction pattern.
Results show that GReaT achieves consistent performance gains with slight computational overheads on the state-of-the-art transformer baselines.
arXiv Detail & Related papers (2022-09-20T08:21:37Z) - BGT-Net: Bidirectional GRU Transformer Network for Scene Graph
Generation [0.15469452301122172]
Scene graph generation (SGG) aims to identify the objects and their relationships.
We propose a bidirectional GRU (BiGRU) transformer network (BGT-Net) for the scene graph generation for images.
This model implements novel object-object communication to enhance the object information using a BiGRU layer.
arXiv Detail & Related papers (2021-09-11T19:14:40Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks [150.5425122989146]
This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS)
AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges.
Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case.
arXiv Detail & Related papers (2020-01-19T10:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.