Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator
for Vision Applications
- URL: http://arxiv.org/abs/2401.06197v1
- Date: Thu, 11 Jan 2024 14:53:24 GMT
- Title: Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator
for Vision Applications
- Authors: Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng
Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie Zhou, Jifeng
Dai
- Abstract summary: We introduce Deformable Convolution v4 (DCNv4), a highly efficient and effective operator designed for a broad spectrum of vision applications.
DCNv4 addresses the limitations of its predecessor, DCNv3, with two key enhancements.
It demonstrates exceptional performance across various tasks, including image classification, instance and semantic segmentation, and notably, image generation.
- Score: 108.44482683870888
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Deformable Convolution v4 (DCNv4), a highly efficient and
effective operator designed for a broad spectrum of vision applications. DCNv4
addresses the limitations of its predecessor, DCNv3, with two key enhancements:
1. removing softmax normalization in spatial aggregation to enhance its dynamic
property and expressive power and 2. optimizing memory access to minimize
redundant operations for speedup. These improvements result in a significantly
faster convergence compared to DCNv3 and a substantial increase in processing
speed, with DCNv4 achieving more than three times the forward speed. DCNv4
demonstrates exceptional performance across various tasks, including image
classification, instance and semantic segmentation, and notably, image
generation. When integrated into generative models like U-Net in the latent
diffusion model, DCNv4 outperforms its baseline, underscoring its possibility
to enhance generative models. In practical applications, replacing DCNv3 with
DCNv4 in the InternImage model to create FlashInternImage results in up to 80%
speed increase and further performance improvement without further
modifications. The advancements in speed and efficiency of DCNv4, combined with
its robust performance across diverse vision tasks, show its potential as a
foundational building block for future vision models.
Related papers
- iiANET: Inception Inspired Attention Hybrid Network for efficient Long-Range Dependency [0.0]
We introduce iiANET (Inception Inspired Attention Network), an efficient hybrid model designed to capture long-range dependencies in complex images.
The fundamental building block, iiABlock, integrates global 2D-MHSA (Multi-Head Self-Attention) with Registers, MBConv2 (MobileNetV2-based convolution), and dilated convolution in parallel.
We serially integrate an ECANET (Efficient Channel Attention Network) at the end of each iiABlock to calibrate channel-wise attention for enhanced model performance.
arXiv Detail & Related papers (2024-07-10T12:39:02Z) - BDC-Occ: Binarized Deep Convolution Unit For Binarized Occupancy Network [55.21288428359509]
Existing 3D occupancy networks demand significant hardware resources, hindering the deployment of edge devices.
We propose a novel binarized deep convolution (BDC) unit that effectively enhances performance while increasing the number of binarized convolutional layers.
Our BDC-Occ model is created by applying the proposed BDC unit to binarize the existing 3D occupancy networks.
arXiv Detail & Related papers (2024-05-27T10:44:05Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like
Architectures [99.20299078655376]
This paper introduces Vision-RWKV, a model adapted from the RWKV model used in the NLP field.
Our model is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities.
Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage.
arXiv Detail & Related papers (2024-03-04T18:46:20Z) - ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR)
ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance.
We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z) - Dynamic Mobile-Former: Strengthening Dynamic Convolution with Attention
and Residual Connection in Kernel Space [4.111899441919165]
Dynamic Mobile-Former maximizes the capabilities of dynamic convolution by harmonizing it with efficient operators.
PVT.A Transformer in Dynamic Mobile-Former only requires a few randomly calculate global features.
Bridge between Dynamic MobileNet and Transformer allows for bidirectional integration of local and global features.
arXiv Detail & Related papers (2023-04-13T05:22:24Z) - Dual Complementary Dynamic Convolution for Image Recognition [13.864357201410648]
We propose a novel two-branch dual complementary dynamic convolution (DCDC) operator for convolutional neural networks (CNNs)
The DCDC operator overcomes the limitations of vanilla convolution and most existing dynamic convolutions who capture only spatial-adaptive features.
Experiments show that the DCDC operator based ResNets (DCDC-ResNets) significantly outperform vanilla ResNets and most state-of-the-art dynamic convolutional networks on image classification.
arXiv Detail & Related papers (2022-11-11T12:32:12Z) - SD-Conv: Towards the Parameter-Efficiency of Dynamic Convolution [16.56592303409295]
Dynamic convolution achieves better performance for efficient CNNs at the cost of negligible FLOPs increase.
We propose a new framework, textbfSparse Dynamic Convolution (textscSD-Conv), to naturally integrate these two paths.
arXiv Detail & Related papers (2022-04-05T14:03:54Z) - Incremental Training and Group Convolution Pruning for Runtime DNN
Performance Scaling on Heterogeneous Embedded Platforms [23.00896228073755]
Inference for Deep Neural Networks is increasingly being executed locally on mobile and embedded platforms.
In this paper, we present a dynamic DNN using incremental training and group convolution pruning.
It achieved 10.6x (energy) and 41.6x (time) wider dynamic range by combining with task mapping and DVFS.
arXiv Detail & Related papers (2021-05-08T05:38:01Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.