Related papers: MobileViG: Graph-Based Sparse Attention for Mobile Vision Applications

MobileViG: Graph-Based Sparse Attention for Mobile Vision Applications

URL: http://arxiv.org/abs/2307.00395v1
Date: Sat, 1 Jul 2023 17:49:12 GMT
Title: MobileViG: Graph-Based Sparse Attention for Mobile Vision Applications
Authors: Mustafa Munir, William Avery, Radu Marculescu
Abstract summary: Vision graph neural networks (ViGs) provide a new avenue for exploration. ViGs are computationally expensive due to the overhead of representing images as graph structures. We propose a new graph-based sparse attention mechanism, Sparse Vision Graph Attention (SVGA), that is designed for ViGs running on mobile devices.
Score: 7.2210216531805695
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Traditionally, convolutional neural networks (CNN) and vision transformers (ViT) have dominated computer vision. However, recently proposed vision graph neural networks (ViG) provide a new avenue for exploration. Unfortunately, for mobile applications, ViGs are computationally expensive due to the overhead of representing images as graph structures. In this work, we propose a new graph-based sparse attention mechanism, Sparse Vision Graph Attention (SVGA), that is designed for ViGs running on mobile devices. Additionally, we propose the first hybrid CNN-GNN architecture for vision tasks on mobile devices, MobileViG, which uses SVGA. Extensive experiments show that MobileViG beats existing ViG models and existing mobile CNN and ViT architectures in terms of accuracy and/or speed on image classification, object detection, and instance segmentation tasks. Our fastest model, MobileViG-Ti, achieves 75.7% top-1 accuracy on ImageNet-1K with 0.78 ms inference latency on iPhone 13 Mini NPU (compiled with CoreML), which is faster than MobileNetV2x1.4 (1.02 ms, 74.7% top-1) and MobileNetV2x1.0 (0.81 ms, 71.8% top-1). Our largest model, MobileViG-B obtains 82.6% top-1 accuracy with only 2.30 ms latency, which is faster and more accurate than the similarly sized EfficientFormer-L3 model (2.77 ms, 82.4%). Our work proves that well designed hybrid CNN-GNN architectures can be a new avenue of exploration for designing models that are extremely fast and accurate on mobile devices. Our code is publicly available at https://github.com/SLDGroup/MobileViG.

Related papers

RapidNet: Multi-Level Dilated Convolution Based Mobile Backbone [6.4399181389092]
We propose Multi-Level Dilated Convolutions to devise a purely CNN-based mobile backbone. Experiments show that our proposed model outperforms state-of-the-art (SOTA) mobile CNN, ViT, ViG, and hybrid architectures in terms of accuracy and/or speed on image classification, object detection, instance segmentation, and semantic segmentation.
arXiv Detail & Related papers (2024-12-14T23:39:03Z)
Scaling Graph Convolutions for Mobile Vision [6.4399181389092]
This paper introduces Mobile Graph Convolution (MGC), a new vision graph neural network (ViG) module that solves this scaling problem. Our proposed mobile vision architecture, MobileViGv2, uses MGC to demonstrate the effectiveness of our approach. Our largest model, MobileViGv2-B, achieves an 83.4% top-1 accuracy, 0.8% higher than MobileViG-B, with 2.7 ms inference latency.
arXiv Detail & Related papers (2024-06-09T16:49:19Z)
GreedyViG: Dynamic Axial Graph Construction for Efficient Vision GNNs [5.895049552752008]
Vision graph neural networks (ViG) offer a new avenue for exploration in computer vision. A major bottleneck in ViGs is the inefficient k-nearest neighbor (KNN) operation used for graph construction. We propose a new method for designing ViGs, Dynamic Axial Graph Construction (DAGC), which is more efficient than KNN. We also propose a novel CNN-GNN architecture, GreedyViG, which uses DAGC.
arXiv Detail & Related papers (2024-05-10T23:21:16Z)
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices [73.46317110474064]
MobileVLM is a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion.
arXiv Detail & Related papers (2023-12-28T08:21:24Z)
RepViT: Revisiting Mobile CNN From ViT Perspective [67.05569159984691]
lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs) In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices.
arXiv Detail & Related papers (2023-07-18T14:24:33Z)
Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency. We also introduce a novel fine-grained joint search strategy for transformer models. This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z)
GhostNetV2: Enhance Cheap Operation with Long-Range Attention [59.65543143580889]
We propose a hardware-friendly attention mechanism (dubbed DFC attention) and then present a new GhostNetV2 architecture for mobile applications. The proposed DFC attention is constructed based on fully-connected layers, which can not only execute fast on common hardware but also capture the dependence between long-range pixels. We further revisit the bottleneck in previous GhostNet and propose to enhance expanded features produced by cheap operations with DFC attention.
arXiv Detail & Related papers (2022-11-23T12:16:59Z)
EfficientFormer: Vision Transformers at MobileNet Speed [43.93223983817965]
Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. ViT-based models are generally times slower than lightweight convolutional networks. Recent efforts try to reduce the complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory.
arXiv Detail & Related papers (2022-06-02T17:51:03Z)
EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z)
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer [24.47196590256829]
We introduce MobileViT, a light-weight vision transformer for mobile devices. Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets.
arXiv Detail & Related papers (2021-10-05T17:07:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.