Exploring Sparse Spatial Relation in Graph Inference for Text-Based VQA
- URL: http://arxiv.org/abs/2310.09147v1
- Date: Fri, 13 Oct 2023 14:39:34 GMT
- Title: Exploring Sparse Spatial Relation in Graph Inference for Text-Based VQA
- Authors: Sheng Zhou, Dan Guo, Jia Li, Xun Yang, Meng Wang
- Abstract summary: We propose a sparse spatial graph network (SSGN) that introduces a spatially aware relation pruning technique to this task.
Experiment results on TextVQA and ST-VQA datasets demonstrate that SSGN achieves promising performances.
- Score: 45.98167752508643
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-based visual question answering (TextVQA) faces the significant
challenge of avoiding redundant relational inference. To be specific, a large
number of detected objects and optical character recognition (OCR) tokens
result in rich visual relationships. Existing works take all visual
relationships into account for answer prediction. However, there are three
observations: (1) a single subject in the images can be easily detected as
multiple objects with distinct bounding boxes (considered repetitive objects).
The associations between these repetitive objects are superfluous for answer
reasoning; (2) two spatially distant OCR tokens detected in the image
frequently have weak semantic dependencies for answer reasoning; and (3) the
co-existence of nearby objects and tokens may be indicative of important visual
cues for predicting answers. Rather than utilizing all of them for answer
prediction, we make an effort to identify the most important connections or
eliminate redundant ones. We propose a sparse spatial graph network (SSGN) that
introduces a spatially aware relation pruning technique to this task. As
spatial factors for relation measurement, we employ spatial distance, geometric
dimension, overlap area, and DIoU for spatially aware pruning. We consider
three visual relationships for graph learning: object-object, OCR-OCR tokens,
and object-OCR token relationships. SSGN is a progressive graph learning
architecture that verifies the pivotal relations in the correlated object-token
sparse graph, and then in the respective object-based sparse graph and
token-based sparse graph. Experiment results on TextVQA and ST-VQA datasets
demonstrate that SSGN achieves promising performances. And some visualization
results further demonstrate the interpretability of our method.
Related papers
- Generalized Visual Relation Detection with Diffusion Models [94.62313788626128]
Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image.
We propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner.
Our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets.
arXiv Detail & Related papers (2025-04-16T14:03:24Z) - EGTR: Extracting Graph from Transformer for Scene Graph Generation [5.935927309154952]
Scene Graph Generation (SGG) is a challenging task of detecting objects and predicting relationships between objects.
We propose a lightweight one-stage SGG model that extracts the relation graph from the various relationships learned in the multi-head self-attention layers of the DETR decoder.
We demonstrate the effectiveness and efficiency of our method for the Visual Genome and Open Image V6 datasets.
arXiv Detail & Related papers (2024-04-02T16:20:02Z) - Relation Rectification in Diffusion Model [64.84686527988809]
We introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate.
We propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN)
The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space.
arXiv Detail & Related papers (2024-03-29T15:54:36Z) - Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language
Models [3.86170450233149]
We show that large vision-and-language models (VLMs) trained to match images with text lack fine-grained understanding of spatial relations.
We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses.
arXiv Detail & Related papers (2023-08-18T18:58:54Z) - Relation Regularized Scene Graph Generation [206.76762860019065]
Scene graph generation (SGG) is built on top of detected objects to predict object pairwise visual relations.
We propose a relation regularized network (R2-Net) which can predict whether there is a relationship between two objects.
Our R2-Net can effectively refine object labels and generate scene graphs.
arXiv Detail & Related papers (2022-02-22T11:36:49Z) - Interactive Visual Pattern Search on Graph Data via Graph Representation
Learning [20.795511688640296]
We propose a visual analytics system GraphQ to support human-in-the-loop, example-based, subgraph pattern search.
To support fast, interactive queries, we use graph neural networks (GNNs) to encode a graph as fixed-length latent vector representation.
We also propose a novel GNN for node-alignment called NeuroAlign to facilitate easy validation and interpretation of the query results.
arXiv Detail & Related papers (2022-02-18T22:30:28Z) - Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in
Visual Question Answering [71.6781118080461]
We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task.
firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information.
Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question.
Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
arXiv Detail & Related papers (2021-12-14T10:01:26Z) - Fully Convolutional Scene Graph Generation [30.194961716870186]
This paper presents a fully convolutional scene graph generation (FCSGG) model that detects objects and relations simultaneously.
FCSGG encodes objects as bounding box center points, and relationships as 2D vector fields which are named as Relation Affinity Fields (RAFs)
FCSGG achieves highly competitive results on recall and zero-shot recall with significantly reduced inference time.
arXiv Detail & Related papers (2021-03-30T05:25:38Z) - Iterative Context-Aware Graph Inference for Visual Dialog [126.016187323249]
We propose a novel Context-Aware Graph (CAG) neural network.
Each node in the graph corresponds to a joint semantic feature, including both object-based (visual) and history-related (textual) context representations.
arXiv Detail & Related papers (2020-04-05T13:09:37Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.