Related papers: AttentionViG: Cross-Attention-Based Dynamic Neighbor Aggregation in Vision GNNs

AttentionViG: Cross-Attention-Based Dynamic Neighbor Aggregation in Vision GNNs

URL: http://arxiv.org/abs/2509.25570v1
Date: Mon, 29 Sep 2025 22:47:48 GMT
Title: AttentionViG: Cross-Attention-Based Dynamic Neighbor Aggregation in Vision GNNs
Authors: Hakan Emre Gedik, Andrew Martin, Mustafa Munir, Oguzhan Baser, Radu Marculescu, Sandeep P. Chinchali, Alan C. Bovik,
Abstract summary: Vision Graph Neural Networks (ViGs) have demonstrated promising performance in image recognition tasks against CNNs and Vision Transformers.<n>An essential part of the ViG framework is the node-neighbor feature aggregation method.<n>We propose a cross-attention-based aggregation method in which the query projections come from the node, while the key projections come from its neighbors.
Score: 40.43076513538705
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Graph Neural Networks (ViGs) have demonstrated promising performance in image recognition tasks against Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). An essential part of the ViG framework is the node-neighbor feature aggregation method. Although various graph convolution methods, such as Max-Relative, EdgeConv, GIN, and GraphSAGE, have been explored, a versatile aggregation method that effectively captures complex node-neighbor relationships without requiring architecture-specific refinements is needed. To address this gap, we propose a cross-attention-based aggregation method in which the query projections come from the node, while the key projections come from its neighbors. Additionally, we introduce a novel architecture called AttentionViG that uses the proposed cross-attention aggregation scheme to conduct non-local message passing. We evaluated the image recognition performance of AttentionViG on the ImageNet-1K benchmark, where it achieved SOTA performance. Additionally, we assessed its transferability to downstream tasks, including object detection and instance segmentation on MS COCO 2017, as well as semantic segmentation on ADE20K. Our results demonstrate that the proposed method not only achieves strong performance, but also maintains efficiency, delivering competitive accuracy with comparable FLOPs to prior vision GNN architectures.

Related papers

GCRPNet: Graph-Enhanced Contextual and Regional Perception Network for Salient Object Detection in Optical Remote Sensing Images [68.33481681452675]
We propose a graph-enhanced contextual and regional perception network (GCRPNet)<n>It builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation.<n>It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information.
arXiv Detail & Related papers (2025-08-14T11:31:43Z)
Vision Graph Prompting via Semantic Low-Rank Decomposition [10.223578525761617]
Vision GNN (ViG) demonstrates superior performance by representing images as graph structures.<n>To efficiently adapt ViG to downstream tasks, parameter-efficient fine-tuning techniques like visual prompting become increasingly essential.<n>We propose Vision Graph Prompting (VGP), a novel framework tailored for vision graph structures.
arXiv Detail & Related papers (2025-05-07T04:29:29Z)
DVHGNN: Multi-Scale Dilated Vision HGNN for Efficient Vision Recognition [7.762533819978473]
We propose a novel vision architecture, termed Dilated Vision HyperGraph Neural Network (DVHGNN)<n>DVHGNN is designed to leverage multi-scale hypergraph to efficiently capture high-order correlations among objects.<n>Our DVHGNN-S achieves an impressive top-1 accuracy of 83.1% on ImageNet-1K, surpassing ViG-S by +1.0% and ViHGNN-S by +0.6%.
arXiv Detail & Related papers (2025-03-19T03:45:23Z)
ClusterViG: Efficient Globally Aware Vision GNNs via Image Partitioning [7.325055402812975]
Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have dominated the field of Computer Vision (CV)<n>Recent works addressing this bottleneck impose constraints on the flexibility of GNNs to build unstructured graphs.<n>We propose a novel method called Dynamic Efficient Graph Convolution (DEGC) for designing efficient and globally aware ViGs.
arXiv Detail & Related papers (2025-01-18T02:59:10Z)
UnSeGArmaNet: Unsupervised Image Segmentation using Graph Neural Networks with Convolutional ARMA Filters [10.940349832919699]
We propose an unsupervised segmentation framework with a pre-trained ViT. By harnessing the graph structure inherent within the image, the proposed method achieves a notable performance in segmentation. The proposed method provides state-of-the-art performance (even comparable to supervised methods) on benchmark image segmentation datasets.
arXiv Detail & Related papers (2024-10-08T15:10:09Z)
Hierarchical Graph Interaction Transformer with Dynamic Token Clustering for Camouflaged Object Detection [57.883265488038134]
We propose a hierarchical graph interaction network termed HGINet for camouflaged object detection. The network is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features. Our experiments demonstrate the superior performance of HGINet compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2024-08-27T12:53:25Z)
BOURNE: Bootstrapped Self-supervised Learning Framework for Unified Graph Anomaly Detection [50.26074811655596]
We propose a novel unified graph anomaly detection framework based on bootstrapped self-supervised learning (named BOURNE) By swapping the context embeddings between nodes and edges, we enable the mutual detection of node and edge anomalies. BOURNE can eliminate the need for negative sampling, thereby enhancing its efficiency in handling large graphs.
arXiv Detail & Related papers (2023-07-28T00:44:57Z)
ViGAT: Bottom-up event recognition and explanation in video using factorized graph attention network [8.395400675921515]
ViGAT is a pure-attention bottom-up approach to derive object and frame features. A head network is proposed to process these features for the task of event recognition and explanation in video. A comprehensive evaluation study is performed, demonstrating that the proposed approach provides state-of-the-art results on three large, publicly available video datasets.
arXiv Detail & Related papers (2022-07-20T14:12:05Z)
A Variational Edge Partition Model for Supervised Graph Representation Learning [51.30365677476971]
This paper introduces a graph generative process to model how the observed edges are generated by aggregating the node interactions over a set of overlapping node communities. We partition each edge into the summation of multiple community-specific weighted edges and use them to define community-specific GNNs. A variational inference framework is proposed to jointly learn a GNN based inference network that partitions the edges into different communities, these community-specific GNNs, and a GNN based predictor that combines community-specific GNNs for the end classification task.
arXiv Detail & Related papers (2022-02-07T14:37:50Z)
Node Similarity Preserving Graph Convolutional Networks [51.520749924844054]
Graph Neural Networks (GNNs) explore the graph structure and node features by aggregating and transforming information within node neighborhoods. We propose SimP-GCN that can effectively and efficiently preserve node similarity while exploiting graph structure. We validate the effectiveness of SimP-GCN on seven benchmark datasets including three assortative and four disassorative graphs.
arXiv Detail & Related papers (2020-11-19T04:18:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.