Related papers: Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture

Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture

URL: http://arxiv.org/abs/2111.06075v1
Date: Thu, 11 Nov 2021 06:55:28 GMT
Title: Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture
Authors: Michael Yang, Aditya Anantharaman, Zachary Kitowski and Derik Clive Robert
Abstract summary: TextVQA is a dataset geared towards answering questions about visual objects and text objects in images. One key challenge in TextVQA is the design of a system that effectively reasons not only about visual and text objects individually, but also about the spatial relationships between these objects. We propose a Graph Relation Transformer (GRT) which uses edge information in addition to node information for graph attention computation in the Transformer.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Previous studies such as VizWiz find that Visual Question Answering (VQA) systems that can read and reason about text in images are useful in application areas such as assisting visually-impaired people. TextVQA is a VQA dataset geared towards this problem, where the questions require answering systems to read and reason about visual objects and text objects in images. One key challenge in TextVQA is the design of a system that effectively reasons not only about visual and text objects individually, but also about the spatial relationships between these objects. This motivates the use of 'edge features', that is, information about the relationship between each pair of objects. Some current TextVQA models address this problem but either only use categories of relations (rather than edge feature vectors) or do not use edge features within the Transformer architectures. In order to overcome these shortcomings, we propose a Graph Relation Transformer (GRT), which uses edge information in addition to node information for graph attention computation in the Transformer. We find that, without using any other optimizations, the proposed GRT method outperforms the accuracy of the M4C baseline model by 0.65% on the val set and 0.57% on the test set. Qualitatively, we observe that the GRT has superior spatial reasoning ability to M4C.

Related papers

Relation Rectification in Diffusion Model [64.84686527988809]
We introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate. We propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN) The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space.
arXiv Detail & Related papers (2024-03-29T15:54:36Z)
DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation [13.058196732927135]
Scene graph generation aims to capture detailed spatial and semantic relationships between objects in an image. Existing Transformer-based methods either employ distinct queries for objects and predicates or utilize holistic queries for relation triplets. We present a new Transformer-based method, called DSGG, that views scene graph detection as a direct graph prediction problem.
arXiv Detail & Related papers (2024-03-21T23:43:30Z)
Making the V in Text-VQA Matter [1.2962828085662563]
Text-based VQA aims at answering questions by reading the text present in the images. Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image. The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
arXiv Detail & Related papers (2023-08-01T05:28:13Z)
Vision Transformer with Quadrangle Attention [76.35955924137986]
We propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation. Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles. We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.
arXiv Detail & Related papers (2023-03-27T11:13:50Z)
Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering [23.083935053799145]
Text-based Visual Question Answering(TextVQA) aims to produce correct answers for given questions about the images with multiple scene texts. We introduce 3D geometric information into a human-like spatial reasoning process to capture key objects' contextual knowledge. Our method achieves state-of-the-art performance on TextVQA and ST-VQA datasets.
arXiv Detail & Related papers (2022-09-21T12:49:14Z)
Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision. We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture. Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z)
Question-Driven Graph Fusion Network For Visual Question Answering [15.098694655795168]
We propose a Question-Driven Graph Fusion Network (QD-GFN) It first models semantic, spatial, and implicit visual relations in images by three graph attention networks, then question information is utilized to guide the aggregation process of the three graphs. Experiment results demonstrate that our QD-GFN outperforms the prior state-of-the-art on both VQA 2.0 and VQA-CP v2 datasets.
arXiv Detail & Related papers (2022-04-03T03:02:03Z)
MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces. We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA) Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z)
LaTr: Layout-Aware Transformer for Scene-Text VQA [8.390314291424263]
We propose a novel architecture for Scene Text Visual Question Answering (STVQA) We show that applying this pre-training scheme on scanned documents has certain advantages over using natural images. Compared to existing approaches, our method performs vocabulary-free decoding and, as shown, generalizes well beyond the training vocabulary.
arXiv Detail & Related papers (2021-12-23T12:41:26Z)
Structured Multimodal Attentions for TextVQA [57.71060302874151]
We propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above. SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it. Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP.
arXiv Detail & Related papers (2020-06-01T07:07:36Z)
GPS-Net: Graph Property Sensing Network for Scene Graph Generation [91.60326359082408]
Scene graph generation (SGG) aims to detect objects in an image along with their pairwise relationships. GPS-Net fully explores three properties for SGG: edge direction information, the difference in priority between nodes, and the long-tailed distribution of relationships. GPS-Net achieves state-of-the-art performance on three popular databases: VG, OI, and VRD by significant gains under various settings and metrics.
arXiv Detail & Related papers (2020-03-29T07:22:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.