AdaptViG: Adaptive Vision GNN with Exponential Decay Gating
- URL: http://arxiv.org/abs/2511.09942v1
- Date: Fri, 14 Nov 2025 01:20:18 GMT
- Title: AdaptViG: Adaptive Vision GNN with Exponential Decay Gating
- Authors: Mustafa Munir, Md Mostafijur Rahman, Radu Marculescu,
- Abstract summary: AdaptViG is an efficient and powerful hybrid Vision GNN that introduces a novel graph construction mechanism called Adaptive Graph Convolution.<n>Our method achieves a new state-of-the-art trade-off between accuracy and efficiency among Vision GNNs.
- Score: 30.689461713712316
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Graph Neural Networks (ViGs) offer a new direction for advancements in vision architectures. While powerful, ViGs often face substantial computational challenges stemming from their graph construction phase, which can hinder their efficiency. To address this issue we propose AdaptViG, an efficient and powerful hybrid Vision GNN that introduces a novel graph construction mechanism called Adaptive Graph Convolution. This mechanism builds upon a highly efficient static axial scaffold and a dynamic, content-aware gating strategy called Exponential Decay Gating. This gating mechanism selectively weighs long-range connections based on feature similarity. Furthermore, AdaptViG employs a hybrid strategy, utilizing our efficient gating mechanism in the early stages and a full Global Attention block in the final stage for maximum feature aggregation. Our method achieves a new state-of-the-art trade-off between accuracy and efficiency among Vision GNNs. For instance, our AdaptViG-M achieves 82.6% top-1 accuracy, outperforming ViG-B by 0.3% while using 80% fewer parameters and 84% fewer GMACs. On downstream tasks, AdaptViG-M obtains 45.8 mIoU, 44.8 APbox, and 41.1 APmask, surpassing the much larger EfficientFormer-L7 by 0.7 mIoU, 2.2 APbox, and 2.1 APmask, respectively, with 78% fewer parameters.
Related papers
- Multi-Scale High-Resolution Logarithmic Grapher Module for Efficient Vision GNNs [25.60289758013904]
Vision graph neural networks (ViG) have demonstrated promise in vision tasks as a competitive alternative to conventional convolutional neural nets (CNN) and transformers (ViTs)<n>We propose a new graph construction method, Logarithmic Scalable Graph Construction (LSGC) to enhance performance by limiting the number of long-range links.<n>Our smallest model, Ti-LogViG, achieves an average top-1 accuracy on ImageNet-1K of 71.7% with a standard deviation of 0.2%.
arXiv Detail & Related papers (2025-10-15T16:47:09Z) - ClusterViG: Efficient Globally Aware Vision GNNs via Image Partitioning [7.325055402812975]
Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have dominated the field of Computer Vision (CV)<n>Recent works addressing this bottleneck impose constraints on the flexibility of GNNs to build unstructured graphs.<n>We propose a novel method called Dynamic Efficient Graph Convolution (DEGC) for designing efficient and globally aware ViGs.
arXiv Detail & Related papers (2025-01-18T02:59:10Z) - GreedyViG: Dynamic Axial Graph Construction for Efficient Vision GNNs [5.895049552752008]
Vision graph neural networks (ViG) offer a new avenue for exploration in computer vision.
A major bottleneck in ViGs is the inefficient k-nearest neighbor (KNN) operation used for graph construction.
We propose a new method for designing ViGs, Dynamic Axial Graph Construction (DAGC), which is more efficient than KNN.
We also propose a novel CNN-GNN architecture, GreedyViG, which uses DAGC.
arXiv Detail & Related papers (2024-05-10T23:21:16Z) - HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs [102.4965532024391]
hybrid deep models of Vision Transformer (ViT) and Convolution Neural Network (CNN) have emerged as a powerful class of backbones for vision tasks.
We present a new hybrid backbone with HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage ViT to five-stage ViT tailored for high-resolution inputs.
HiRI-ViT achieves to-date the best published Top-1 accuracy of 84.3% on ImageNet with 448$times$448 inputs, which absolutely improves 83.4% of iFormer-S by 0.9% with 224$times
arXiv Detail & Related papers (2024-03-18T17:34:29Z) - HEViTPose: High-Efficiency Vision Transformer for Human Pose Estimation [3.1690891866882236]
This paper proposes a High- Efficiency Vision Transformer for Human Pose Estimation (HEViTPose)
In HEViTPose, a Cascaded Group Spatial Reduction Multi-Head Attention Module (CGSR-MHA) is proposed, which reduces the computational cost.
Comprehensive experiments on two benchmark datasets (MPII and COCO) demonstrate that the small and large HEViTPose models are on par with state-of-the-art models.
arXiv Detail & Related papers (2023-11-22T06:45:16Z) - T-GAE: Transferable Graph Autoencoder for Network Alignment [79.89704126746204]
T-GAE is a graph autoencoder framework that leverages transferability and stability of GNNs to achieve efficient network alignment without retraining.
Our experiments demonstrate that T-GAE outperforms the state-of-the-art optimization method and the best GNN approach by up to 38.7% and 50.8%, respectively.
arXiv Detail & Related papers (2023-10-05T02:58:29Z) - PVG: Progressive Vision Graph for Vision Recognition [48.11440886492801]
We propose a Progressive Vision Graph (PVG) architecture for vision recognition task.<n>PVG contains three main components: 1) Progressively Separated Graph Construction (PSGC), 2) Neighbor nodes information aggregation and update module, and 3) Graph error Linear Unit (GraphLU)
arXiv Detail & Related papers (2023-08-01T14:35:29Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision.
We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.
Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z) - EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense
Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention.
Our multi-scale linear attention achieves the global receptive field and multi-scale learning.
EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.