Dynamic Mobile-Former: Strengthening Dynamic Convolution with Attention
and Residual Connection in Kernel Space
- URL: http://arxiv.org/abs/2304.07254v1
- Date: Thu, 13 Apr 2023 05:22:24 GMT
- Title: Dynamic Mobile-Former: Strengthening Dynamic Convolution with Attention
and Residual Connection in Kernel Space
- Authors: Seokju Yun, Youngmin Ro
- Abstract summary: Dynamic Mobile-Former maximizes the capabilities of dynamic convolution by harmonizing it with efficient operators.
PVT.A Transformer in Dynamic Mobile-Former only requires a few randomly calculate global features.
Bridge between Dynamic MobileNet and Transformer allows for bidirectional integration of local and global features.
- Score: 4.111899441919165
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce Dynamic Mobile-Former(DMF), maximizes the capabilities of
dynamic convolution by harmonizing it with efficient operators.Our Dynamic
MobileFormer effectively utilizes the advantages of Dynamic MobileNet
(MobileNet equipped with dynamic convolution) using global information from
light-weight attention.A Transformer in Dynamic Mobile-Former only requires a
few randomly initialized tokens to calculate global features, making it
computationally efficient.And a bridge between Dynamic MobileNet and
Transformer allows for bidirectional integration of local and global
features.We also simplify the optimization process of vanilla dynamic
convolution by splitting the convolution kernel into an input-agnostic kernel
and an input-dependent kernel.This allows for optimization in a wider kernel
space, resulting in enhanced capacity.By integrating lightweight attention and
enhanced dynamic convolution, our Dynamic Mobile-Former achieves not only high
efficiency, but also strong performance.We benchmark the Dynamic Mobile-Former
on a series of vision tasks, and showcase that it achieves impressive
performance on image classification, COCO detection, and instanace
segmentation.For example, our DMF hits the top-1 accuracy of 79.4% on
ImageNet-1K, much higher than PVT-Tiny by 4.3% with only 1/4
FLOPs.Additionally,our proposed DMF-S model performed well on challenging
vision datasets such as COCO, achieving a 39.0% mAP,which is 1% higher than
that of the Mobile-Former 508M model, despite using 3 GFLOPs less
computations.Code and models are available at https://github.com/ysj9909/DMF
Related papers
- KernelWarehouse: Rethinking the Design of Dynamic Convolution [16.101179962553385]
KernelWarehouse redefines the basic concepts of Kernels", assembling kernels" and attention function"
We testify the effectiveness of KernelWarehouse on ImageNet and MS-COCO datasets using various ConvNet architectures.
arXiv Detail & Related papers (2024-06-12T05:16:26Z) - Efficient Modulation for Vision Networks [122.1051910402034]
We propose efficient modulation, a novel design for efficient vision networks.
We demonstrate that the modulation mechanism is particularly well suited for efficient networks.
Our network can accomplish better trade-offs between accuracy and efficiency.
arXiv Detail & Related papers (2024-03-29T03:48:35Z) - SGDM: Static-Guided Dynamic Module Make Stronger Visual Models [0.9012198585960443]
spatial attention mechanism has been widely used to improve object detection performance.
We propose Razor Dynamic Convolution (RDConv) to address thetwo flaws in dynamic weight convolution.
We introduce the mechanism of shared weights in static convolution to solve the problem of dynamic convolution being sensitive to high-frequency noise.
arXiv Detail & Related papers (2024-03-27T06:18:40Z) - Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator
for Vision Applications [108.44482683870888]
We introduce Deformable Convolution v4 (DCNv4), a highly efficient and effective operator designed for a broad spectrum of vision applications.
DCNv4 addresses the limitations of its predecessor, DCNv3, with two key enhancements.
It demonstrates exceptional performance across various tasks, including image classification, instance and semantic segmentation, and notably, image generation.
arXiv Detail & Related papers (2024-01-11T14:53:24Z) - DAT++: Spatially Dynamic Vision Transformer with Deformable Attention [87.41016963608067]
We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition.
DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
arXiv Detail & Related papers (2023-09-04T08:26:47Z) - Vision Transformer Computation and Resilience for Dynamic Inference [3.6929360462568077]
We leverage the resilience of vision transformers to pruning and switch between different scaled versions of a model.
Most FLOPs are generated by convolutions, not attention.
Some models are fairly resilient and their model execution can be adapted without retraining.
arXiv Detail & Related papers (2022-12-06T01:10:31Z) - PAD-Net: An Efficient Framework for Dynamic Networks [72.85480289152719]
Common practice in implementing dynamic networks is to convert the given static layers into fully dynamic ones.
We propose a partially dynamic network, namely PAD-Net, to transform the redundant dynamic parameters into static ones.
Our method is comprehensively supported by large-scale experiments with two typical advanced dynamic architectures.
arXiv Detail & Related papers (2022-11-10T12:42:43Z) - SD-Conv: Towards the Parameter-Efficiency of Dynamic Convolution [16.56592303409295]
Dynamic convolution achieves better performance for efficient CNNs at the cost of negligible FLOPs increase.
We propose a new framework, textbfSparse Dynamic Convolution (textscSD-Conv), to naturally integrate these two paths.
arXiv Detail & Related papers (2022-04-05T14:03:54Z) - DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and
Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels.
We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z) - Revisiting Dynamic Convolution via Matrix Decomposition [81.89967403872147]
We propose dynamic channel fusion to replace dynamic attention over channel groups.
Our method is easier to train and requires significantly fewer parameters without sacrificing accuracy.
arXiv Detail & Related papers (2021-03-15T23:03:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.