Related papers: Graph Neural Networks in Vision-Language Image Understanding: A Survey

Graph Neural Networks in Vision-Language Image Understanding: A Survey

URL: http://arxiv.org/abs/2303.03761v2
Date: Fri, 12 Apr 2024 06:42:47 GMT
Title: Graph Neural Networks in Vision-Language Image Understanding: A Survey
Authors: Henry Senior, Gregory Slabaugh, Shanxin Yuan, Luca Rossi,
Abstract summary: 2D image understanding is a complex problem within computer vision. It holds the key to providing human-level scene comprehension. In recent years graph neural networks (GNNs) have become a standard component of many 2D image understanding pipelines.
Score: 6.813036707969848
License: http://creativecommons.org/licenses/by/4.0/
Abstract: 2D image understanding is a complex problem within computer vision, but it holds the key to providing human-level scene comprehension. It goes further than identifying the objects in an image, and instead, it attempts to understand the scene. Solutions to this problem form the underpinning of a range of tasks, including image captioning, visual question answering (VQA), and image retrieval. Graphs provide a natural way to represent the relational arrangement between objects in an image, and thus, in recent years graph neural networks (GNNs) have become a standard component of many 2D image understanding pipelines, becoming a core architectural component, especially in the VQA group of tasks. In this survey, we review this rapidly evolving field and we provide a taxonomy of graph types used in 2D image understanding approaches, a comprehensive list of the GNN models used in this domain, and a roadmap of future potential developments. To the best of our knowledge, this is the first comprehensive survey that covers image captioning, visual question answering, and image retrieval techniques that focus on using GNNs as the main part of their architecture.

Related papers

Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference. We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z)
A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective [71.03621840455754]
Graph Neural Networks (GNNs) have gained momentum in graph representation learning. graph Transformers embed a graph structure into the Transformer architecture to overcome the limitations of local neighborhood aggregation. This paper presents a comprehensive review of GNNs and graph Transformers in computer vision from a task-oriented perspective.
arXiv Detail & Related papers (2022-09-27T08:10:14Z)
Vision GNN: An Image is Worth Graph of Nodes [49.3335689216822]
We propose to represent the image as a graph structure and introduce a new Vision GNN (ViG) architecture to extract graph-level feature for visual tasks. Based on the graph representation of images, we build our ViG model to transform and exchange information among all the nodes. Extensive experiments on image recognition and object detection tasks demonstrate the superiority of our ViG architecture.
arXiv Detail & Related papers (2022-06-01T07:01:04Z)
SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning. We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z)
Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering [71.6781118080461]
We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task. firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information. Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question. Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
arXiv Detail & Related papers (2021-12-14T10:01:26Z)
Scene Graph Generation with Geometric Context [12.074766935042586]
Scene graph, a visually grounded graphical structure of an image, immensely helps to simplify the image understanding tasks. We introduce a post-processing algorithm called Geometric Context to understand the visual scenes better geometrically. We exploit this context by calculating the direction and distance between object pairs.
arXiv Detail & Related papers (2021-11-25T15:42:21Z)
Understanding the Role of Scene Graphs in Visual Question Answering [26.02889386248289]
We conduct experiments on the GQA dataset which presents a challenging set of questions requiring counting, compositionality and advanced reasoning capability. We adopt image + question architectures for use with scene graphs, evaluate various scene graph generation techniques for unseen images, propose a training curriculum to leverage human-annotated and auto-generated scene graphs. We present a multi-faceted study into the use of scene graphs for Visual Question Answering, making this work the first of its kind.
arXiv Detail & Related papers (2021-01-14T07:27:37Z)
Multi-Modal Retrieval using Graph Neural Networks [1.8911962184174562]
We learn a joint vision and concept embedding in the same high-dimensional space. We model the visual and concept relationships as a graph structure. We also introduce a novel inference time control, based on selective neighborhood connectivity.
arXiv Detail & Related papers (2020-10-04T19:34:20Z)
Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text [93.08109196909763]
We propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN) It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively. It then introduces three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities.
arXiv Detail & Related papers (2020-03-31T05:56:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.