A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective
- URL: http://arxiv.org/abs/2209.13232v4
- Date: Wed, 14 Aug 2024 09:05:15 GMT
- Title: A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective
- Authors: Chaoqi Chen, Yushuang Wu, Qiyuan Dai, Hong-Yu Zhou, Mutian Xu, Sibei Yang, Xiaoguang Han, Yizhou Yu,
- Abstract summary: Graph Neural Networks (GNNs) have gained momentum in graph representation learning.
graph Transformers embed a graph structure into the Transformer architecture to overcome the limitations of local neighborhood aggregation.
This paper presents a comprehensive review of GNNs and graph Transformers in computer vision from a task-oriented perspective.
- Score: 71.03621840455754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Graph Neural Networks (GNNs) have gained momentum in graph representation learning and boosted the state of the art in a variety of areas, such as data mining (\emph{e.g.,} social network analysis and recommender systems), computer vision (\emph{e.g.,} object detection and point cloud learning), and natural language processing (\emph{e.g.,} relation extraction and sequence learning), to name a few. With the emergence of Transformers in natural language processing and computer vision, graph Transformers embed a graph structure into the Transformer architecture to overcome the limitations of local neighborhood aggregation while avoiding strict structural inductive biases. In this paper, we present a comprehensive review of GNNs and graph Transformers in computer vision from a task-oriented perspective. Specifically, we divide their applications in computer vision into five categories according to the modality of input data, \emph{i.e.,} 2D natural images, videos, 3D data, vision + language, and medical images. In each category, we further divide the applications according to a set of vision tasks. Such a task-oriented taxonomy allows us to examine how each task is tackled by different GNN-based approaches and how well these approaches perform. Based on the necessary preliminaries, we provide the definitions and challenges of the tasks, in-depth coverage of the representative approaches, as well as discussions regarding insights, limitations, and future directions.
Related papers
- A Review of Transformer-Based Models for Computer Vision Tasks: Capturing Global Context and Spatial Relationships [0.5639904484784127]
Transformer-based models have transformed the landscape of natural language processing (NLP)
These models are renowned for their ability to capture long-range dependencies and contextual information.
We discuss potential research directions and applications of transformer-based models in computer vision.
arXiv Detail & Related papers (2024-08-27T16:22:18Z) - Graph Transformers: A Survey [15.68583521879617]
Graph transformers are a recent advancement in machine learning, offering a new class of neural network models for graph-structured data.
This survey provides an in-depth review of recent progress and challenges in graph transformer research.
arXiv Detail & Related papers (2024-07-13T05:15:24Z) - A Survey on Structure-Preserving Graph Transformers [2.5252594834159643]
We provide a comprehensive overview of structure-preserving graph transformers and generalize these methods from the perspective of their design objective.
We also discuss challenges and future directions for graph transformer models to preserve the graph structure and understand the nature of graphs.
arXiv Detail & Related papers (2024-01-29T14:18:09Z) - Graph Neural Networks in Vision-Language Image Understanding: A Survey [6.813036707969848]
2D image understanding is a complex problem within computer vision.
It holds the key to providing human-level scene comprehension.
In recent years graph neural networks (GNNs) have become a standard component of many 2D image understanding pipelines.
arXiv Detail & Related papers (2023-03-07T09:56:23Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Graph Neural Networks: Methods, Applications, and Opportunities [1.2183405753834562]
This article provides a comprehensive survey of graph neural networks (GNNs) in each learning setting.
The approaches for each learning task are analyzed from both theoretical as well as empirical standpoints.
Various applications and benchmark datasets are also provided, along with open challenges still plaguing the general applicability of GNNs.
arXiv Detail & Related papers (2021-08-24T13:46:19Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - Learning Physical Graph Representations from Visual Scenes [56.7938395379406]
Physical Scene Graphs (PSGs) represent scenes as hierarchical graphs with nodes corresponding intuitively to object parts at different scales, and edges to physical connections between parts.
PSGNet augments standard CNNs by including: recurrent feedback connections to combine low and high-level image information; graph pooling and vectorization operations that convert spatially-uniform feature maps into object-centric graph structures.
We show that PSGNet outperforms alternative self-supervised scene representation algorithms at scene segmentation tasks.
arXiv Detail & Related papers (2020-06-22T16:10:26Z) - GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training [62.73470368851127]
Graph representation learning has emerged as a powerful technique for addressing real-world problems.
We design Graph Contrastive Coding -- a self-supervised graph neural network pre-training framework.
We conduct experiments on three graph learning tasks and ten graph datasets.
arXiv Detail & Related papers (2020-06-17T16:18:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.