Related papers: GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets

GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets

URL: http://arxiv.org/abs/2404.04924v1
Date: Sun, 7 Apr 2024 11:48:07 GMT
Title: GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets
Authors: Dongjing Shan, guiqiang chen,
Abstract summary: We propose a Graph-based Vision Transformer (GvT) that utilizes graph convolutional projection and graph-pooling. GvT produces comparable or superior outcomes to deep convolutional networks without pre-training on large datasets.
Score: 1.1586742546971471
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformers (ViTs) have achieved impressive results in large-scale image classification. However, when training from scratch on small datasets, there is still a significant performance gap between ViTs and Convolutional Neural Networks (CNNs), which is attributed to the lack of inductive bias. To address this issue, we propose a Graph-based Vision Transformer (GvT) that utilizes graph convolutional projection and graph-pooling. In each block, queries and keys are calculated through graph convolutional projection based on the spatial adjacency matrix, while dot-product attention is used in another graph convolution to generate values. When using more attention heads, the queries and keys become lower-dimensional, making their dot product an uninformative matching function. To overcome this low-rank bottleneck in attention heads, we employ talking-heads technology based on bilinear pooled features and sparse selection of attention tensors. This allows interaction among filtered attention scores and enables each attention mechanism to depend on all queries and keys. Additionally, we apply graph-pooling between two intermediate blocks to reduce the number of tokens and aggregate semantic information more effectively. Our experimental results show that GvT produces comparable or superior outcomes to deep convolutional networks and surpasses vision transformers without pre-training on large datasets. The code for our proposed model is publicly available on the website.

Related papers

TANGNN: a Concise, Scalable and Effective Graph Neural Networks with Top-m Attention Mechanism for Graph Representation Learning [7.879217146851148]
We propose an innovative Graph Neural Network (GNN) architecture that integrates a Top-m attention mechanism aggregation component and a neighborhood aggregation component. To assess the effectiveness of our proposed model, we have applied it to citation sentiment prediction, a novel task previously unexplored in the GNN field.
arXiv Detail & Related papers (2024-11-23T05:31:25Z)
SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers [0.0]
Vision Transformers (ViTs) have redefined image classification by leveraging self-attention to capture complex patterns and long-range dependencies between image patches. A key challenge for ViTs is efficiently incorporating multi-scale feature representations, which is inherent in convolutional neural networks (CNNs) through their hierarchical structure. We propose SAG-ViT, a Scale-Aware Graph Attention ViT that integrates multi-scale feature capabilities of CNNs, representational power of ViTs, graph-attended patching to enable richer contextual representation.
arXiv Detail & Related papers (2024-11-14T13:15:27Z)
Cell Graph Transformer for Nuclei Classification [78.47566396839628]
We develop a cell graph transformer (CGT) that treats nodes and edges as input tokens to enable learnable adjacency and information exchange among all nodes. Poorly features can lead to noisy self-attention scores and inferior convergence. We propose a novel topology-aware pretraining method that leverages a graph convolutional network (GCN) to learn a feature extractor.
arXiv Detail & Related papers (2024-02-20T12:01:30Z)
GraphGPT: Generative Pre-trained Graph Eulerian Transformer [8.675197550607358]
We introduce a novel generative pre-trained model for graph learning based on the Graph Eulerian Transformer (GET) GraphGPT achieves performance comparable to or surpassing state-of-the-art methods on multiple large-scale Open Graph Benchmark (OGB) datasets. Notably, generative pre-training enables scaling GraphGPT to 2 billion parameters while maintaining performance gains.
arXiv Detail & Related papers (2023-12-31T16:19:30Z)
Deep Prompt Tuning for Graph Transformers [55.2480439325792]
Fine-tuning is resource-intensive and requires storing multiple copies of large models. We propose a novel approach called deep graph prompt tuning as an alternative to fine-tuning. By freezing the pre-trained parameters and only updating the added tokens, our approach reduces the number of free parameters and eliminates the need for multiple model copies.
arXiv Detail & Related papers (2023-09-18T20:12:17Z)
Multimodal Graph Transformer for Multimodal Question Answering [9.292566397511763]
We propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.
arXiv Detail & Related papers (2023-04-30T21:22:35Z)
Graph Mixture of Experts: Learning on Large-Scale Graphs with Explicit Diversity Modeling [60.0185734837814]
Graph neural networks (GNNs) have found extensive applications in learning from graph data. To bolster the generalization capacity of GNNs, it has become customary to augment training graph structures with techniques like graph augmentations. This study introduces the concept of Mixture-of-Experts (MoE) to GNNs, with the aim of augmenting their capacity to adapt to a diverse range of training graph structures.
arXiv Detail & Related papers (2023-04-06T01:09:36Z)
Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision. A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive. We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z)
Neural Graph Matching for Pre-training Graph Neural Networks [72.32801428070749]
Graph neural networks (GNNs) have been shown powerful capacity at modeling structural data. We present a novel Graph Matching based GNN Pre-Training framework, called GMPT. The proposed method can be applied to fully self-supervised pre-training and coarse-grained supervised pre-training.
arXiv Detail & Related papers (2022-03-03T09:53:53Z)
Vision Transformer with Deformable Attention [29.935891419574602]
Large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts. We propose a novel deformable self-attention module, where the positions of key and value pairs in self-attention are selected in a data-dependent way. We present Deformable Attention Transformer, a general backbone model with deformable attention for both image classification and dense prediction tasks.
arXiv Detail & Related papers (2022-01-03T08:29:01Z)
Graph Contrastive Learning with Augmentations [109.23158429991298]
We propose a graph contrastive learning (GraphCL) framework for learning unsupervised representations of graph data. We show that our framework can produce graph representations of similar or better generalizability, transferrability, and robustness compared to state-of-the-art methods.
arXiv Detail & Related papers (2020-10-22T20:13:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.