Related papers: Exploring Sparse Spatial Relation in Graph Inference for Text-Based VQA

Exploring Sparse Spatial Relation in Graph Inference for Text-Based VQA

URL: http://arxiv.org/abs/2310.09147v1
Date: Fri, 13 Oct 2023 14:39:34 GMT
Title: Exploring Sparse Spatial Relation in Graph Inference for Text-Based VQA
Authors: Sheng Zhou, Dan Guo, Jia Li, Xun Yang, Meng Wang
Abstract summary: We propose a sparse spatial graph network (SSGN) that introduces a spatially aware relation pruning technique to this task. Experiment results on TextVQA and ST-VQA datasets demonstrate that SSGN achieves promising performances.
Score: 45.98167752508643
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-based visual question answering (TextVQA) faces the significant challenge of avoiding redundant relational inference. To be specific, a large number of detected objects and optical character recognition (OCR) tokens result in rich visual relationships. Existing works take all visual relationships into account for answer prediction. However, there are three observations: (1) a single subject in the images can be easily detected as multiple objects with distinct bounding boxes (considered repetitive objects). The associations between these repetitive objects are superfluous for answer reasoning; (2) two spatially distant OCR tokens detected in the image frequently have weak semantic dependencies for answer reasoning; and (3) the co-existence of nearby objects and tokens may be indicative of important visual cues for predicting answers. Rather than utilizing all of them for answer prediction, we make an effort to identify the most important connections or eliminate redundant ones. We propose a sparse spatial graph network (SSGN) that introduces a spatially aware relation pruning technique to this task. As spatial factors for relation measurement, we employ spatial distance, geometric dimension, overlap area, and DIoU for spatially aware pruning. We consider three visual relationships for graph learning: object-object, OCR-OCR tokens, and object-OCR token relationships. SSGN is a progressive graph learning architecture that verifies the pivotal relations in the correlated object-token sparse graph, and then in the respective object-based sparse graph and token-based sparse graph. Experiment results on TextVQA and ST-VQA datasets demonstrate that SSGN achieves promising performances. And some visualization results further demonstrate the interpretability of our method.

Related papers

Generalized Visual Relation Detection with Diffusion Models [94.62313788626128]
Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. We propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner. Our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets.
arXiv Detail & Related papers (2025-04-16T14:03:24Z)
EGTR: Extracting Graph from Transformer for Scene Graph Generation [5.935927309154952]
Scene Graph Generation (SGG) is a challenging task of detecting objects and predicting relationships between objects. We propose a lightweight one-stage SGG model that extracts the relation graph from the various relationships learned in the multi-head self-attention layers of the DETR decoder. We demonstrate the effectiveness and efficiency of our method for the Visual Genome and Open Image V6 datasets.
arXiv Detail & Related papers (2024-04-02T16:20:02Z)
Relation Rectification in Diffusion Model [64.84686527988809]
We introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate. We propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN) The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space.
arXiv Detail & Related papers (2024-03-29T15:54:36Z)
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models [3.86170450233149]
We show that large vision-and-language models (VLMs) trained to match images with text lack fine-grained understanding of spatial relations. We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses.
arXiv Detail & Related papers (2023-08-18T18:58:54Z)
Relation Regularized Scene Graph Generation [206.76762860019065]
Scene graph generation (SGG) is built on top of detected objects to predict object pairwise visual relations. We propose a relation regularized network (R2-Net) which can predict whether there is a relationship between two objects. Our R2-Net can effectively refine object labels and generate scene graphs.
arXiv Detail & Related papers (2022-02-22T11:36:49Z)
Interactive Visual Pattern Search on Graph Data via Graph Representation Learning [20.795511688640296]
We propose a visual analytics system GraphQ to support human-in-the-loop, example-based, subgraph pattern search. To support fast, interactive queries, we use graph neural networks (GNNs) to encode a graph as fixed-length latent vector representation. We also propose a novel GNN for node-alignment called NeuroAlign to facilitate easy validation and interpretation of the query results.
arXiv Detail & Related papers (2022-02-18T22:30:28Z)
Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering [71.6781118080461]
We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task. firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information. Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question. Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
arXiv Detail & Related papers (2021-12-14T10:01:26Z)
Fully Convolutional Scene Graph Generation [30.194961716870186]
This paper presents a fully convolutional scene graph generation (FCSGG) model that detects objects and relations simultaneously. FCSGG encodes objects as bounding box center points, and relationships as 2D vector fields which are named as Relation Affinity Fields (RAFs) FCSGG achieves highly competitive results on recall and zero-shot recall with significantly reduced inference time.
arXiv Detail & Related papers (2021-03-30T05:25:38Z)
Iterative Context-Aware Graph Inference for Visual Dialog [126.016187323249]
We propose a novel Context-Aware Graph (CAG) neural network. Each node in the graph corresponds to a joint semantic feature, including both object-based (visual) and history-related (textual) context representations.
arXiv Detail & Related papers (2020-04-05T13:09:37Z)
Expressing Objects just like Words: Recurrent Visual Embedding for Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image. We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN) Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.