Related papers: iFormer: Integrating ConvNet and Transformer for Mobile Application

iFormer: Integrating ConvNet and Transformer for Mobile Application

URL: http://arxiv.org/abs/2501.15369v2
Date: Mon, 17 Feb 2025 15:09:31 GMT
Title: iFormer: Integrating ConvNet and Transformer for Mobile Application
Authors: Chuanyang Zheng,
Abstract summary: iFormer integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention.<n>We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks.
Score: 0.6798775532273751
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a new family of mobile hybrid vision networks, called iFormer, with a focus on optimizing latency and accuracy on mobile applications. iFormer effectively integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention. The local interactions are derived from transforming a standard convolutional network, \textit{i.e.}, ConvNeXt, to design a more lightweight mobile network. Our newly introduced mobile modulation attention removes memory-intensive operations in MHA and employs an efficient modulation mechanism to boost dynamic global representational capacity. We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks. Notably, iFormer achieves an impressive Top-1 accuracy of 80.4\% on ImageNet-1k with a latency of only 1.10 ms on an iPhone 13, surpassing the recently proposed MobileNetV4 under similar latency constraints. Additionally, our method shows significant improvements in downstream tasks, including COCO object detection, instance segmentation, and ADE20k semantic segmentation, while still maintaining low latency on mobile devices for high-resolution inputs in these scenarios.

Related papers

CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z)
MobileMamba: Lightweight Multi-Receptive Visual Mamba Network [51.33486891724516]
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs. We propose the MobileMamba framework, which balances efficiency and performance. MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods.
arXiv Detail & Related papers (2024-11-24T18:01:05Z)
CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [73.80247057590519]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability.<n>We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications.<n>Our model achieves 83.0%/84.1% top-1 with only 12M/21M parameters on ImageNet-1K.
arXiv Detail & Related papers (2024-08-07T11:33:46Z)
Lightweight Vision Transformer with Bidirectional Interaction [59.39874544410419]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information.<n>Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z)
Dynamic Mobile-Former: Strengthening Dynamic Convolution with Attention and Residual Connection in Kernel Space [4.111899441919165]
Dynamic Mobile-Former maximizes the capabilities of dynamic convolution by harmonizing it with efficient operators. PVT.A Transformer in Dynamic Mobile-Former only requires a few randomly calculate global features. Bridge between Dynamic MobileNet and Transformer allows for bidirectional integration of local and global features.
arXiv Detail & Related papers (2023-04-13T05:22:24Z)
SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications [98.90623605283564]
We introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications. We build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2.
arXiv Detail & Related papers (2023-03-27T17:59:58Z)
Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency. We also introduce a novel fine-grained joint search strategy for transformer models. This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z)
MobileOne: An Improved One millisecond Mobile Backbone [14.041480018494394]
We analyze different metrics by deploying several mobile-friendly networks on a mobile device. We design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12. We show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile.
arXiv Detail & Related papers (2022-06-08T17:55:11Z)
EfficientFormer: Vision Transformers at MobileNet Speed [43.93223983817965]
Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. ViT-based models are generally times slower than lightweight convolutional networks. Recent efforts try to reduce the complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory.
arXiv Detail & Related papers (2022-06-02T17:51:03Z)
EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z)
MicroNet: Improving Image Recognition with Extremely Low FLOPs [82.54764264255505]
We find two factors, sparse connectivity and dynamic activation function, are effective to improve the accuracy. We present a new dynamic activation function, named Dynamic Shift Max, to improve the non-linearity. We arrive at a family of networks, named MicroNet, that achieves significant performance gains over the state of the art in the low FLOP regime.
arXiv Detail & Related papers (2021-08-12T17:59:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.